Cross Reference: /freebsd-11-stable/sys/netinet/ip

History log of /freebsd-11-stable/sys/netinet/ip_reass.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 341260	29-Nov-2018	markj	MFC r340483 (by jtl): Add some additional length checks to the IPv4 fragmentation code.
# 337802	14-Aug-2018	jtl	MFC r337786: Lower the default limits on the IPv4 reassembly queue. In particular, try to ensure that no bucket will have a reassembly queue larger than approximately 100 items. This limits the cost to find the correct reassembly queue when processing an incoming fragment. Due to the low limits on each bucket's length, increase the size of the hash table from 64 to 1024. Approved by: so Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923
# 337796	14-Aug-2018	jtl	MFC r337780: Implement a limit on on the number of IPv4 reassembly queues per bucket. There is a hashing algorithm which should distribute IPv4 reassembly queues across the available buckets in a relatively even way. However, if there is a flaw in the hashing algorithm which allows a large number of IPv4 fragment reassembly queues to end up in a single bucket, a per- bucket limit could help mitigate the performance impact of this flaw. Implement such a limit, with a default of twice the maximum number of reassembly queues divided by the number of buckets. Recalculate the limit any time the maximum number of reassembly queues changes. However, allow the user to override the value using a sysctl (net.inet.ip.maxfragbucketsize). Approved by: so Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923
# 337795	14-Aug-2018	jtl	MFC r337778: Add a global limit on the number of IPv4 fragments. The IP reassembly fragment limit is based on the number of mbuf clusters, which are a global resource. However, the limit is currently applied on a per-VNET basis. Given enough VNETs (or given sufficient customization of enough VNETs), it is possible that the sum of all the VNET limits will exceed the number of mbuf clusters available in the system. Given the fact that the fragment limit is intended (at least in part) to regulate access to a global resource, the fragment limit should be applied on a global basis. VNET-specific limits can be adjusted by modifying the net.inet.ip.maxfragpackets and net.inet.ip.maxfragsperpacket sysctls. To disable fragment reassembly globally, set net.inet.ip.maxfrags to 0. To disable fragment reassembly for a particular VNET, set net.inet.ip.maxfragpackets to 0. Approved by: so Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923
# 337789	14-Aug-2018	jtl	MFC r337775: Improve hashing of IPv4 fragments. Currently, IPv4 fragments are hashed into buckets based on a 32-bit key which is calculated by (src_ip ^ ip_id) and combined with a random seed. However, because an attacker can control the values of src_ip and ip_id, it is possible to construct an attack which causes very deep chains to form in a given bucket. To ensure more uniform distribution (and lower predictability for an attacker), calculate the hash based on a key which includes all the fields we use to identify a reassembly queue (dst_ip, src_ip, ip_id, and the ip protocol) as well as a random seed. Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923
# 330302	03-Mar-2018	np	MFC r328314: Do not generate illegal mbuf chains during IP fragment reassembly. Only the first mbuf of the reassembled datagram should have a pkthdr.
# 302408	07-Jul-2016	gjb	Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle. Prune svn:mergeinfo from the new branch, as nothing has been merged here. Additional commits post-branch will follow. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-11-stable/MAINTAINERS /freebsd-11-stable/cddl /freebsd-11-stable/cddl/contrib/opensolaris /freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print /freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs /freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs /freebsd-11-stable/contrib/amd /freebsd-11-stable/contrib/apr /freebsd-11-stable/contrib/apr-util /freebsd-11-stable/contrib/atf /freebsd-11-stable/contrib/binutils /freebsd-11-stable/contrib/bmake /freebsd-11-stable/contrib/byacc /freebsd-11-stable/contrib/bzip2 /freebsd-11-stable/contrib/com_err /freebsd-11-stable/contrib/compiler-rt /freebsd-11-stable/contrib/dialog /freebsd-11-stable/contrib/dma /freebsd-11-stable/contrib/dtc /freebsd-11-stable/contrib/ee /freebsd-11-stable/contrib/elftoolchain /freebsd-11-stable/contrib/elftoolchain/ar /freebsd-11-stable/contrib/elftoolchain/brandelf /freebsd-11-stable/contrib/elftoolchain/elfdump /freebsd-11-stable/contrib/expat /freebsd-11-stable/contrib/file /freebsd-11-stable/contrib/gcc /freebsd-11-stable/contrib/gcclibs/libgomp /freebsd-11-stable/contrib/gdb /freebsd-11-stable/contrib/gdtoa /freebsd-11-stable/contrib/groff /freebsd-11-stable/contrib/ipfilter /freebsd-11-stable/contrib/ldns /freebsd-11-stable/contrib/ldns-host /freebsd-11-stable/contrib/less /freebsd-11-stable/contrib/libarchive /freebsd-11-stable/contrib/libarchive/cpio /freebsd-11-stable/contrib/libarchive/libarchive /freebsd-11-stable/contrib/libarchive/libarchive_fe /freebsd-11-stable/contrib/libarchive/tar /freebsd-11-stable/contrib/libc++ /freebsd-11-stable/contrib/libc-vis /freebsd-11-stable/contrib/libcxxrt /freebsd-11-stable/contrib/libexecinfo /freebsd-11-stable/contrib/libpcap /freebsd-11-stable/contrib/libstdc++ /freebsd-11-stable/contrib/libucl /freebsd-11-stable/contrib/libxo /freebsd-11-stable/contrib/llvm /freebsd-11-stable/contrib/llvm/projects/libunwind /freebsd-11-stable/contrib/llvm/tools/clang /freebsd-11-stable/contrib/llvm/tools/lldb /freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump /freebsd-11-stable/contrib/llvm/tools/llvm-lto /freebsd-11-stable/contrib/mdocml /freebsd-11-stable/contrib/mtree /freebsd-11-stable/contrib/ncurses /freebsd-11-stable/contrib/netcat /freebsd-11-stable/contrib/ntp /freebsd-11-stable/contrib/nvi /freebsd-11-stable/contrib/one-true-awk /freebsd-11-stable/contrib/openbsm /freebsd-11-stable/contrib/openpam /freebsd-11-stable/contrib/openresolv /freebsd-11-stable/contrib/pf /freebsd-11-stable/contrib/sendmail /freebsd-11-stable/contrib/serf /freebsd-11-stable/contrib/sqlite3 /freebsd-11-stable/contrib/subversion /freebsd-11-stable/contrib/tcpdump /freebsd-11-stable/contrib/tcsh /freebsd-11-stable/contrib/tnftp /freebsd-11-stable/contrib/top /freebsd-11-stable/contrib/top/install-sh /freebsd-11-stable/contrib/tzcode/stdtime /freebsd-11-stable/contrib/tzcode/zic /freebsd-11-stable/contrib/tzdata /freebsd-11-stable/contrib/unbound /freebsd-11-stable/contrib/vis /freebsd-11-stable/contrib/wpa /freebsd-11-stable/contrib/xz /freebsd-11-stable/crypto/heimdal /freebsd-11-stable/crypto/openssh /freebsd-11-stable/crypto/openssl /freebsd-11-stable/gnu/lib /freebsd-11-stable/gnu/usr.bin/binutils /freebsd-11-stable/gnu/usr.bin/cc/cc_tools /freebsd-11-stable/gnu/usr.bin/gdb /freebsd-11-stable/lib/libc/locale/ascii.c /freebsd-11-stable/sys/cddl/contrib/opensolaris /freebsd-11-stable/sys/contrib/dev/acpica /freebsd-11-stable/sys/contrib/ipfilter /freebsd-11-stable/sys/contrib/libfdt /freebsd-11-stable/sys/contrib/octeon-sdk /freebsd-11-stable/sys/contrib/x86emu /freebsd-11-stable/sys/contrib/xz-embedded /freebsd-11-stable/usr.sbin/bhyve/atkbdc.h /freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c /freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h /freebsd-11-stable/usr.sbin/bhyve/console.c /freebsd-11-stable/usr.sbin/bhyve/console.h /freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c /freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c /freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h /freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c /freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h /freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c /freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h /freebsd-11-stable/usr.sbin/bhyve/rfb.c /freebsd-11-stable/usr.sbin/bhyve/rfb.h /freebsd-11-stable/usr.sbin/bhyve/sockstream.c /freebsd-11-stable/usr.sbin/bhyve/sockstream.h /freebsd-11-stable/usr.sbin/bhyve/usb_emul.c /freebsd-11-stable/usr.sbin/bhyve/usb_emul.h /freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c /freebsd-11-stable/usr.sbin/bhyve/vga.c /freebsd-11-stable/usr.sbin/bhyve/vga.h
# 281541	14-Apr-2015	adrian	Fix RSS build - netisr input / NETISR_IP_DIRECT is used here.
# 281352	10-Apr-2015	glebius	o Use Jenkins hash. With previous hash, for a single source IP address and sequential IP ID case (e.g. ping -f), distribution fell into 8-10 buckets out of 64. With Jenkins hash, distribution is even. o Add random seed to the hash. Sponsored by: Nginx, Inc.
# 281351	10-Apr-2015	glebius	Move all code related to IP fragment reassembly to ip_reass.c. Some function names have changed and comments are reformatted or added, but there is no functional change. Claim copyright for me and Adrian. Sponsored by: Nginx, Inc.
# 281342	09-Apr-2015	glebius	Now that IP reassembly is no longer under single lock, book-keeping amount of allocations in V_nipq is racy. To fix that, we would simply stop doing book-keeping ourselves, and rely on UMA doing that. There could be a slight overcommit due to caches, but that isn't a big deal. o V_nipq and V_maxnipq go away. o net.inet.ip.fragpackets is now just SYSCTL_UMA_CUR() o net.inet.ip.maxfragpackets could have been just SYSCTL_UMA_MAX(), but historically it has special semantics about values of 0 and -1, so provide sysctl_maxfragpackets() to handle these special cases. o If zone limit lowers either due to net.inet.ip.maxfragpackets or due to kern.ipc.nmbclusters, then new function ipq_drain_tomax() goes over buckets and frees the oldest packets until we are in the limit. The code that (incorrectly) did that in ip_slowtimo() is removed. o ip_reass() doesn't check any limits and calls uma_zalloc(M_NOWAIT). If it fails, a new function ipq_reuse() is called. This function will find the oldest packet in the currently locked bucket, and if there is none, it will search in other buckets until success. Sponsored by: Nginx, Inc.
# 281334	09-Apr-2015	glebius	In the ip_reass() do packet examination and adjusting before acquiring locks and doing lookups. Sponsored by: Nginx, Inc.
# 281329	09-Apr-2015	glebius	Make ip reassembly queue mutexes per-vnet, putting them into the structure that they protect. Sponsored by: Nginx, Inc.
# 281296	09-Apr-2015	glebius	Use TAILQ_FOREACH_SAFE() instead of implementing it ourselves. Sponsored by: Nginx, Inc.
# 281295	09-Apr-2015	glebius	If V_maxnipq is set to zero, drain the queue here and now, instead of relying on timeouts. Sponsored by: Nginx, Inc.
# 281294	09-Apr-2015	glebius	o Since we always update either fragdrop or fragtimeout stat counter when we free a fragment, provide two inline functions that do that for us: ipq_drop() and ipq_timeout(). o Rename ip_free_f() to ipq_free() to match the name scheme of IP reassembly. o Remove assertion from ipq_free(), since it requires extra argument to be passed, but locking scheme is simple enough and function is static. Sponsored by: Nginx, Inc.
# 281293	09-Apr-2015	glebius	Rename ip_drain_locked() to ip_drain_vnet(), since the function differs from ip_drain() not in locking, but in the scope of its work. Sponsored by: Nginx, Inc.
# 281239	07-Apr-2015	adrian	Move the IPv4 reassembly queue locking from a single lock to be per-bucket (global). This significantly improves performance on multi-core servers where there is any kind of IPv4 reassembly going on. glebius@ would like to see the locking moved to be attached to the reassembly bucket, which would make it per-bucket + per-VNET, instead of being global. I decided to keep it global for now as it's the minimal useful change; if people agree / wish to migrate it to be per-bucket / per-VNET then please do feel free to do so. I won't complain. Thanks to Norse Corp for giving me access to much larger servers to test this at across the 4 core boxes I have at home. Differential Revision: https://reviews.freebsd.org/D2095 Reviewed by: glebius (initial comments incorporated into this patch) MFC after: 2 weeks Sponsored by: Norse Corp, Inc (hardware)
# 280971	01-Apr-2015	glebius	o Use new function ip_fillid() in all places throughout the kernel, where we want to create a new IP datagram. o Add support for RFC6864, which allows to set IP ID for atomic IP datagrams to any value, to improve performance. The behaviour is controlled by net.inet.ip.rfc6864 sysctl knob, which is enabled by default. o In case if we generate IP ID, use counter(9) to improve performance. o Gather all code related to IP ID into ip_id.c. Differential Revision: https://reviews.freebsd.org/D2177 Reviewed by: adrian, cy, rpaulo Tested by: Emeric POUPON <emeric.poupon stormshield.eu> Sponsored by: Netflix Sponsored by: Nginx, Inc. Relnotes: yes
# 277331	18-Jan-2015	adrian	Refactor / restructure the RSS code into generic, IPv4 and IPv6 specific bits. The motivation here is to eventually teach netisr and potentially other networking subsystems a bit more about how RSS work queues / buckets are configured so things have a hope of auto-configuring in the future. * net/rss_config.[ch] takes care of the generic bits for doing configuration, hash function selection, etc; * topelitz.[ch] is now in net/ rather than netinet/; * (and would be in libkern if it didn't directly include RSS_KEYSIZE; that's a later thing to fix up.) * netinet/in_rss.[ch] now just contains the IPv4 specific methods; * and netinet/in6_rss.[ch] now just contains the IPv6 specific methods. This should have no functional impact on anyone currently using the RSS support. Differential Revision: D1383 Reviewed by: gnn, jfv (intel driver bits)
# 275704	11-Dec-2014	ae	Move ip_ipsec_fwd() from ip_input() into ip_forward(). Remove check for presence PACKET_TAG_IPSEC_IN_DONE mbuf tag from ip_ipsec_fwd(). PACKET_TAG_IPSEC_IN_DONE tag means that packet is already handled by IPSEC code. This means that before IPSEC processing it was destined to our address and security policy was checked in the ip_ipsec_input(). After IPSEC processing packet has new IP addresses and destination address isn't our own. So, anyway we can't check security policy from the mbuf tag, because it corresponds to different addresses. We should check security policy that corresponds to packet attributes in both cases - when it has a mbuf tag and when it has not. Obtained from: Yandex LLC Sponsored by: Yandex LLC
# 275703	11-Dec-2014	ae	Remove PACKET_TAG_IPSEC_IN_DONE mbuf tag lookup and usage of its security policy. The changed block of code in ip*_ipsec_input() is called when packet has ESP/AH header. Presence of PACKET_TAG_IPSEC_IN_DONE mbuf tag in the same time means that packet was already handled by IPSEC and reinjected in the netisr, and it has another ESP/AH headers (encrypted twice?). Since it was already processed by IPSEC code, the AH/ESP headers was already stripped (and probably outer IP header was stripped too) and security policy from the tdb_ident was applied to those headers. It is incorrect to apply this security policy to current headers. Also make ip_ipsec_input() prototype similar to ip6_ipsec_input(). Obtained from: Yandex LLC Sponsored by: Yandex LLC
# 275358	01-Dec-2014	hselasky	Start process of removing the use of the deprecated "M_FLOWID" flag from the FreeBSD network code. The flag is still kept around in the "sys/mbuf.h" header file, but does no longer have any users. Instead the "m_pkthdr.rsstype" field in the mbuf structure is now used to decide the meaning of the "m_pkthdr.flowid" field. To modify the "m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX" macros as defined in the "sys/mbuf.h" header file. This patch introduces new behaviour in the transmit direction. Previously network drivers checked if "M_FLOWID" was set in "m_flags" before using the "m_pkthdr.flowid" field. This check has now now been replaced by checking if "M_HASHTYPE_GET(m)" is different from "M_HASHTYPE_NONE". In the future more hashtypes will be added, for example hashtypes for hardware dedicated flows. "M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is valid and has no particular type. This change removes the need for an "if" statement in TCP transmit code checking for the presence of a valid flowid value. The "if" statement mentioned above is now a direct variable assignment which is then later checked by the respective network drivers like before. Additional notes: - The SCTP code changes will be committed as a separate patch. - Removal of the "M_FLOWID" flag will also be done separately. - The FreeBSD version has been bumped. MFC after: 1 month Sponsored by: Mellanox Technologies
# 274363	11-Nov-2014	melifaro	Kill custom in_matroute() radix mathing function removing one rte mutex lock. Initially in_matrote() in_clsroute() in their current state was introduced by r4105 20 years ago. Instead of deleting inactive routes immediately, we kept them in route table, setting RTPRF_OURS flag and some expire time. After that, either GC came or RTPRF_OURS got removed on first-packet. It was a good solution in that days (and probably another decade after that) to keep TCP metrics. However, after moving metrics to TCP hostcache in r122922, most of in_rmx functionality became unused. It might had been used for flushing icmp-originated routes before rte mutexes/refcounting, but I'm not sure about that. So it looks like this is nearly impossible to make GC do its work nowadays: in_rtkill() ignores non-RTPRF_OURS routes. route can only become RTPRF_OURS after dropping last reference via rtfree() which calls in_clsroute(), which, it turn, ignores UP and non-RTF_DYNAMIC routes. Dynamic routes can still be installed via received redirect, but they have default lifetime (no specific rt_expire) and no one has another trie walker to call RTFREE() on them. So, the changelist: * remove custom rnh_match / rnh_close matching function. * remove all GC functions * partially revert r256695 (proto3 is no more used inside kernel, it is not possible to use rt_expire from user point of view, proto3 support is not complete) * Finish r241884 (similar to this commit) and remove remaining IPv6 parts MFC after: 1 month
# 274359	10-Nov-2014	melifaro	Remove kernel handling of ICMP_SOURCEQUENCH. It hasn't been used for a very long time. Additionally, it was deprecated by RFC 6633.
# 274331	09-Nov-2014	melifaro	Renove faith(4) and faithd(8) from base. It looks like industry have chosen different (and more traditional) stateless/statuful NAT64 as translation mechanism. Last non-trivial commits to both faith(4) and faithd(8) happened more than 12 years ago, so I assume it is time to drop RFC3142 in FreeBSD. No objections from: net@
# 274225	07-Nov-2014	glebius	Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed. Sponsored by: Nginx, Inc.
# 272199	27-Sep-2014	adrian	Remove an un-needed bit of pre-processor work - it all lives inside #ifdef RSS.
# 271300	09-Sep-2014	adrian	Update the IPv4 input path to handle reassembled frames and incoming frames with no RSS hash. When doing RSS: * Create a new IPv4 netisr which expects the frames to have been verified; it just directly dispatches to the IPv4 input path. * Once IPv4 reassembly is done, re-calculate the RSS hash with the new IP and L3 header; then reinject it as appropriate. * Update the IPv4 netisr to be a CPU affinity netisr with the RSS hash function (rss_soft_m2cpuid) - this will do a software hash if the hardware doesn't provide one. NICs that don't implement hardware RSS hashing will now benefit from RSS distribution - it'll inject into the correct destination netisr. Note: the netisr distribution doesn't work out of the box - netisr doesn't query RSS for how many CPUs and the affinity setup. Yes, netisr likely shouldn't really be doing CPU stuff anymore and should be "some kind of 'thing' that is a workqueue that may or may not have any CPU affinity"; that's for a later commit. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan
# 271293	08-Sep-2014	adrian	Add support for receiving and setting flowtype, flowid and RSS bucket information as part of recvmsg(). This is primarily used for debugging/verification of the various processing paths in the IP, PCB and driver layers. Unfortunately the current implementation of the control message path results in a ~10% or so drop in UDP frame throughput when it's used. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan
# 269699	08-Aug-2014	kevlo	Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have only one protocol switch structure that is shared between ipv4 and ipv6. Phabric: D476 Reviewed by: jhb
# 265942	13-May-2014	yongari	Fix checksum computation. Previously it didn't include carry. Reviewed by: tuexen
# 263091	12-Mar-2014	glebius	Since both netinet/ and netinet6/ call into netipsec/ and netpfil/, the protocol specific mbuf flags are shared between them. - Move all M_FOO definitions into a single place: netinet/in6.h, to avoid future clashes. - Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted in a failure of operation of IPSEC and packet filters. Thanks to Nicolas and Georgios for all the hard work on bisecting, testing and finally finding the root of the problem. PR: kern/186755 PR: kern/185876 In collaboration with: Georgios Amanakis <gamanakis gmail.com> In collaboration with: Nicolas DEFFAYET <nicolas-ml deffayet.com> Sponsored by: Nginx, Inc.
# 262763	04-Mar-2014	glebius	- Remove rt_metrics_lite and simply put its members into rtentry. - Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This removes another cache trashing ++ from packet forwarding path. - Create zini/fini methods for the rtentry UMA zone. Via initialize mutex and counter in them. - Fix reporting of rmx_pksent to routing socket. - Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode. The change is mostly targeted for stable/10 merge. For head, rt_pksent is expected to just disappear. Discussed with: melifaro Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 261601	07-Feb-2014	glebius	o Revamp API between flowtable and netinet, netinet6. - ip_output() and ip_output6() simply call flowtable_lookup(), passing mbuf and address family. That's the only code under #ifdef FLOWTABLE in the protocols code now. o Revamp statistics gathering and export. - Remove hand made pcpu stats, and utilize counter(9). - Snapshot of statistics is available via 'netstat -rs'. - All sysctls are moved into net.flowtable namespace, since spreading them over net.inet isn't correct. o Properly separate at compile time INET and INET6 parts. o General cleanup. - Remove chain of multiple flowtables. We simply have one for IPv4 and one for IPv6. - Flowtables are allocated in flowtable.c, symbols are static. - With proper argument to SYSINIT() we no longer need flowtable_ready. - Hash salt doesn't need to be per-VNET. - Removed rudimentary debugging, which use quite useless in dtrace era. The runtime behavior of flowtable shouldn't be changed by this commit. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 258541	25-Nov-2013	attilio	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip
# 256525	15-Oct-2013	glebius	- Utilize counter(9) to accumulate statistics on interface addresses. Add four counters to struct ifaddr. This kills '+=' on a variables shared between processors for every packet. - Nuke struct if_data from struct ifaddr. - In ip_input() do not put a reference on ifaddr, instead update statistics right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9) for every packet. [1] - To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in rtsock.c fill if_data fields using counter_u64_fetch(). - Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which took if_data not from the ifaddr, but from ifaddr's ifnet. [2] Submitted by: melifaro [1], pluknet[2] Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 255523	13-Sep-2013	trociny	Unregister inet/inet6 pfil hooks on vnet destroy. Discussed with: andre Approved by: re (rodrigc)
# 254889	25-Aug-2013	markj	Implement the ip, tcp, and udp DTrace providers. The probe definitions use dynamic translation so that their arguments match the definitions for these providers in Solaris and illumos. Thus, existing scripts for these providers should work unmodified on FreeBSD. Tested by: gnn, hiren MFC after: 1 month
# 254804	24-Aug-2013	andre	Restructure the mbuf pkthdr to make it fit for upcoming capabilities and features. The changes in particular are: o Remove rarely used "header" pointer and replace it with a 64bit protocol/ layer specific union PH_loc for local use. Protocols can flexibly overlay their own 8 to 64 bit fields to store information while the packet is worked on. o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc instead of pkthdr.header. o Extend csum_flags to 64bits to allow for additional future offload information to be carried (e.g. iSCSI, IPsec offload, and others). o Move the RSS hash type enumerator from abusing m_flags to its own 8bit rsstype field. Adjust accessor macros. o Add cosqos field to store Class of Service / Quality of Service information with the packet. It is not yet supported in any drivers but allows us to get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with a modernized ALTQ. o Add four 8 bit fields l[2-5]hlen to store the relative header offsets from the start of the packet. This is important for various offload capabilities and to relieve the drivers from having to parse the packet and protocol headers to find out location of checksums and other information. Header parsing in drivers is a lot of copy-paste and unhandled corner cases which we want to avoid. o Add another flexible 64bit union to map various additional persistent packet information, like ether_vtag, tso_segsz and csum fields. Depending on the csum_flags settings some fields may have different usage making it very flexible and adaptable to future capabilities. o Restructure the CSUM flags to better signify their outbound (down the stack) and inbound (up the stack) use. The CSUM flags used to be a bit chaotic and rather poorly documented leading to incorrect use in many places. Bring clarity into their use through better naming. Compatibility mappings are provided to preserve the API. The drivers can be corrected one by one and MFC'd without issue. o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures). Sponsored by: The FreeBSD Foundation
# 254518	19-Aug-2013	andre	Move ip_reassemble()'s use of the global M_FRAG mbuf flag to a protocol layer specific flag instead. The flag is only relevant while the packet stays in the IP reassembly queue. Discussed with: trociny, glebius
# 253083	09-Jul-2013	ae	Use new macros to implement ipstat and tcpstat using PCPU counters. Change interface of kread_counters() similar ot kread() in the netstat(1).
# 252055	21-Jun-2013	glebius	Fix kmod_*stat_inc() after r249276. The incorrect code actually increased the pointer, not the memory it points to. In collaboration with: kib Reported & tested by: Ian FREISLICH <ianf clue.co.za> Sponsored by: Nginx, Inc.
# 250300	06-May-2013	andre	Back out r249318, r249320 and r249327 due to a heisenbug most likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.
# 249318	09-Apr-2013	andre	Change certain heavily used network related mutexes and rwlocks to reside on their own cache line to prevent false sharing with other nearby structures, especially for those in the .bss segment. NB: Those mutexes and rwlocks with variables next to them that get changed on every invocation do not benefit from their own cache line. Actually it may be net negative because two cache misses would be incurred in those cases.
# 249276	08-Apr-2013	glebius	Merge from projects/counters: TCP/IP stats. Convert 'struct ipstat' and 'struct tcpstat' to counter(9). This speeds up IP forwarding at extreme packet rates, and makes accounting more precise. Sponsored by: Nginx, Inc.
# 248324	15-Mar-2013	glebius	Use m_get/m_gethdr instead of compat macros. Sponsored by: Nginx, Inc.
# 247044	20-Feb-2013	pluknet	ip_savecontrol() style fixes. No functional changes. - fix indentation - put the operator at the end of the line for long statements - remove spaces between the type and the variable in a cast - remove excessive parentheses Tested by: md5
# 243882	05-Dec-2012	glebius	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually
# 242463	01-Nov-2012	ae	Remove the recently added sysctl variable net.pfil.forward. Instead, add protocol specific mbuf flags M_IP_NEXTHOP and M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup only when this flag is set. Suggested by: andre
# 242079	25-Oct-2012	ae	Remove the IPFIREWALL_FORWARD kernel option and make possible to turn on the related functionality in the runtime via the sysctl variable net.pfil.forward. It is turned off by default. Sponsored by: Yandex LLC Discussed with: net@ MFC after: 2 weeks
# 242077	25-Oct-2012	glebius	After r241923 the updated ip_len no longer needed.
# 242076	25-Oct-2012	glebius	Fix error in r241913 that had broken fragment reassembly.
# 241923	23-Oct-2012	glebius	Do not reduce ip_len by size of IP header in the ip_input() before passing a packet to protocol input routines. For several protocols this mean that now protocol needs to do subtraction itself, and for another half this means that we do not need to add header length back to the packet. Make ip_stripoptions() to adjust ip_len, since now we enter this function with a packet header whose ip_len does represent length of entire packet, not payload only.
# 241913	22-Oct-2012	glebius	Switch the entire IPv4 stack to keep the IP packet header in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
# 241245	06-Oct-2012	glebius	A step in resolving mess with byte ordering for AF_INET. After this change: - All packets in NETISR_IP queue are in net byte order. - ip_input() is entered in net byte order and converts packet to host byte order right _after_ processing pfil(9) hooks. - ip_output() is entered in host byte order and converts packet to net byte order right _before_ processing pfil(9) hooks. - ip_fragment() accepts and emits packet in net byte order. - ip_forward(), ip_mloopback() use host byte order (untouched actually). - ip_fastforward() no longer modifies packet at all (except ip_ttl). - Swapping of byte order there and back removed from the following modules: pf(4), ipfw(4), enc(4), if_bridge(4). - Swapping of byte order added to ipfilter(4), based on __FreeBSD_version - __FreeBSD_version bumped. - pfil(9) manual page updated. Reviewed by: ray, luigi, eri, melifaro Tested by: glebius (LE), ray (BE)
# 238092	04-Jul-2012	glebius	When ip_output()/ip6_output() is supplied a struct route *ro argument, it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning here: it may be supplied to provide route, and it may be supplied to store and return to caller the route that ip_output()/ip6_output() finds. In the latter case skipping FLOWTABLE lookup is pessimisation. The difference between struct route filled by FLOWTABLE and filled by rtalloc() family is that the former doesn't hold a reference on its rtentry. Reference is hold by flow entry, and it is about to be released in future. Thus, route filled by FLOWTABLE shouldn't be passed to RTFREE() macro. - Introduce new flag for struct route/route_in6, that marks route not holding a reference on rtentry. - Introduce new macro RO_RTFREE() that cleans up a struct route depending on its kind. - All callers to ip_output()/ip6_output() that do supply non-NULL but empty route should use RO_RTFREE() to free results of lookup. - ip_output()/ip6_output() now do FLOWTABLE lookup always when ro->ro_rt == NULL. Tested by: tuexen (SCTP part)
# 236959	12-Jun-2012	tuexen	Add a IP_RECVTOS socket option to receive for received UDP/IPv4 packets a cmsg of type IP_RECVTOS which contains the TOS byte. Much like IP_RECVTTL does for TTL. This allows to implement a protocol on top of UDP and implementing ECN. MFC after: 3 days
# 229621	05-Jan-2012	jhb	Convert all users of IF_ADDR_LOCK to use new locking macros that specify either a read lock or write lock. Reviewed by: bz MFC after: 2 weeks
# 226401	15-Oct-2011	glebius	Remove last remnants of classful addressing: - Remove ia_net, ia_netmask, ia_netbroadcast from struct in_ifaddr. - Remove net.inet.ip.subnetsarelocal, I bet no one need it in 2011. - fix bug when we were not forwarding to a host which matches classful net address. For example router having 192.168.x.y/16 network attached, would not forward traffic to 192.168.*.0, which are legal IPs in CIDR world. - For compatibility, leave autoguessing of mask based on class. Reviewed by: andre, bz, rwatson
# 222845	08-Jun-2011	bz	Correct comments and debug logging in ipsec to better match reality. MFC after: 3 days
# 221131	27-Apr-2011	bz	MfP4 CH=192004: Move ip_defttl to raw_ip.c where it is actually used. In an IPv6 only world we do not want to compile ip_input.c in for that and it is a shared default with INET6. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days
# 220879	20-Apr-2011	bz	MFp4 CH=191470: Move the ipport_tick_callout and related functions from ip_input.c to in_pcb.c. The random source port allocation code has been merged and is now local to in_pcb.c only. Use a SYSINIT to get the callout started and no longer depend on initialization from the inet code, which would not work in an IPv6 only setup. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days
# 220878	20-Apr-2011	bz	MFp4 CH=191466: Move fw_one_pass to where it belongs: it is a property of ipfw, not of ip_input. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 3 days
# 218909	21-Feb-2011	brucec	Fix typos - remove duplicate "the". PR: bin/154928 Submitted by: Eitan Adler <lists at eitanadler.com> MFC after: 3 days
# 215701	22-Nov-2010	dim	After some off-list discussion, revert a number of changes to the DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless. Changes reverted: ------------------------------------------------------------------------ r215318 \| dim \| 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) \| 4 lines Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined. ------------------------------------------------------------------------ r215317 \| dim \| 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) \| 3 lines Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree. ------------------------------------------------------------------------ r215316 \| dim \| 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) \| 2 lines Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
# 215317	14-Nov-2010	dim	Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
# 212155	02-Sep-2010	bz	MFp4 CH=183052 183053 183258: In protosw we define pr_protocol as short, while on the wire it is an uint8_t. That way we can have "internal" protocols like DIVERT, SEND or gaps for modules (PROTO_SPACER). Switch ipproto_{un,}register to accept a short protocol number() and do an upfront check for valid boundries. With this we also consistently report EPROTONOSUPPORT for out of bounds protocols, as we did for proto == 0. This allows a caller to not error for this case, which is especially important if we want to automatically call these from domain handling. () the functions have been without any in-tree consumer since the initial introducation, so this is considered save. Implement ip6proto_{un,}register() similarly to their legacy IP counter parts to allow modules to hook up dynamically. Reviewed by: philip, will MFC after: 1 week
# 211157	10-Aug-2010	will	Allow carp(4) to be loaded as a kernel module. Follow precedent set by bridge(4), lagg(4) etc. and make use of function pointers and pf_proto_register() to hook carp into the network stack. Currently, because of the uncertainty about whether the unload path is free of race condition panics, unloads are disallowed by default. Compiling with CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure. This commit requires IP6PROTOSPACER, introduced in r211115. Reviewed by: bz, simon Approved by: ken (mentor) MFC after: 2 weeks
# 207369	29-Apr-2010	bz	MFP4: @176978-176982, 176984, 176990-176994, 177441 "Whitspace" churn after the VIMAGE/VNET whirls. Remove the need for some "init" functions within the network stack, like pim6_init(), icmp_init() or significantly shorten others like ip6_init() and nd6_init(), using static initialization again where possible and formerly missed. Move (most) variables back to the place they used to be before the container structs and VIMAGE_GLOABLS (before r185088) and try to reduce the diff to stable/7 and earlier as good as possible, to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9. This also removes some header file pollution for putatively static global variables. Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are no longer needed. Reviewed by: jhb Discussed with: rwatson Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH MFC after: 6 days
# 206989	21-Apr-2010	bz	Avoid memory access after free. Use the (shortend) copy for the ipsec mtu lookup as well. PR: kern/145736 Submitted by: Peter Molnar (peter molnar.cc) MFC after: 3 days
# 205488	22-Mar-2010	kmacy	- boot-time size the ipv4 flowtable and the maximum number of flows - increase flow cleaning frequency and decrease flow caching time when near the flow limit - stop allocating new flows when within 3% of maxflows don't start allocating again until below 12.5% MFC after: 7 days
# 205066	12-Mar-2010	kmacy	- restructure flowtable to support ipv6 - add a name argument to flowtable_alloc for printing with ddb commands - extend ddb commands to print destination address or 4-tuples - don't parse ports in ulp header if FL_HASH_ALL is not passed - add kern_flowtable_insert to enable more generic use of flowtable (e.g. system calls for adding entries) - don't hash loopback addresses - cleanup whitespace - keep statistics per-cpu for per-cpu flowtables to avoid cache line contention - add sysctls to accumulate stats and report aggregate MFC after: 7 days
# 204140	20-Feb-2010	bz	Split up ip_drain() into an outer lock and iterator part and a "locked" version that will only handle a single network stack instance. The latter is called directly from ip_destroy(). Hook up an ip_destroy() function to release resources from the legacy IP network layer upon virtual network stack teardown. Sponsored by: ISPsystem Reviewed by: rwatson MFC After: 5 days
# 198438	24-Oct-2009	rwatson	Correct spelling typo in ip_input comment. Pointed out by: N.J. Mann <njm at njm.me.uk>, John Nielsen <john at jnielsen.net>, julian (!), lstewart MFC after: 2 days
# 198393	23-Oct-2009	rwatson	Improve grammar in ip_input comment while attempting to maintain what might be its meaning. MFC after: 3 days
# 198196	18-Oct-2009	rwatson	Rewrap ip_input() comment so that it prints more nicely. MFC after: 3 days
# 197952	11-Oct-2009	julian	Virtualize the pfil hooks so that different jails may chose different packet filters. ALso allows ipfw to be enabled on on ejail and disabled on another. In 8.0 it's a global setting. Sitting aroung in tree waiting to commit for: 2 months MFC after: 2 months
# 196039	02-Aug-2009	rwatson	Many network stack subsystems use a single global data structure to hold all pertinent statatistics for the subsystem. These structures are sometimes "borrowed" by kernel modules that require a place to store statistics for similar events. Add KPI accessor functions for statistics structures referenced by kernel modules so that they no longer encode certain specifics of how the data structures are named and stored. This change is intended to make it easier to move to per-CPU network stats following 8.0-RELEASE. The following modules are affected by this change: if_bridge if_cxgb if_gif ip_mroute ipdivert pf In practice, most of these statistics consumers should, in fact, maintain their own statistics data structures rather than borrowing structures from the base network stack. However, that change is too agressive for this point in the release cycle. Reviewed by: bz Approved by: re (kib)
# 196019	01-Aug-2009	rwatson	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# 195788	20-Jul-2009	rwatson	Back out the moving in r195782 of V_ip_id's initialization from the top back to the bottom of ip_init() as found in 7.x. I missed the fact that the bottom half of the init routine only runs in the !VNET case. Submitted by: zec Approved by: re (vimage blanket)
# 195782	20-Jul-2009	rwatson	Garbage collect vnet module registrations that have neither constructors nor destructors, as there's no actual work to do. In most cases, the constructors weren't needed because of the existing protocol initialization functions run by net_init_domain() as part of VNET_MOD_NET, or they were eliminated when support for static initialization of virtualized globals was added. Garbage collect dependency references to modules without constructors or destructors, notably VNET_MOD_INET and VNET_MOD_INET6. Reviewed by: bz Approved by: re (vimage blanket)
# 195760	19-Jul-2009	rwatson	Reimplement and/or implement vnet list locking by replacing a mostly unused custom mutex/condvar-based sleep locks with two locks: an rwlock (for non-sleeping use) and sxlock (for sleeping use). Either acquired for read is sufficient to stabilize the vnet list, but both must be acquired for write to modify the list. Replace previous no-op read locking macros, used in various places in the stack, with actual locking to prevent race conditions. Callers must declare when they may perform unbounded sleeps or not when selecting how to lock. Refactor vnet sysinits so that the vnet list and locks are initialized before kernel modules are linked, as the kernel linker will use them for modules loaded by the boot loader. Update various consumers of these KPIs based on whether they may sleep or not. Reviewed by: bz Approved by: re (kib)
# 195727	16-Jul-2009	rwatson	Remove unused VNET_SET() and related macros; only VNET_GET() is ever actually used. Rename VNET_GET() to VNET() to shorten variable references. Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib)
# 195699	14-Jul-2009	rwatson	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
# 194962	25-Jun-2009	rwatson	Initialize in_ifaddr_lock using RW_SYSINIT() instead of in ip_init(), so that it doesn't run multiple times if VIMAGE is being used. Discussed with: bz MFC after: 6 weeks
# 194951	25-Jun-2009	rwatson	Add a new global rwlock, in_ifaddr_lock, which will synchronize use of the in_ifaddrhead and INADDR_HASH address lists. Previously, these lists were used unsynchronized as they were effectively never changed in steady state, but we've seen increasing reports of writer-writer races on very busy VPN servers as core count has gone up (and similar configurations where address lists change frequently and concurrently). For the time being, use rwlocks rather than rmlocks in order to take advantage of their better lock debugging support. As a result, we don't enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion is complete and a performance analysis has been done. This means that one class of reader-writer races still exists. MFC after: 6 weeks Reviewed by: bz
# 194835	24-Jun-2009	rwatson	Clear 'ia' after iterating if_addrhead for unicast address matching: since 'ifa' was used as the TAILQ_FOREACH() iterator argument, and 'ia' was just derived form it, it could be left non-NULL which confused later conditional freeing code. This could cause kernel panics if multicast IP packets were received. [1] Call 'struct in_ifaddr *' in ip_rtaddr() 'ia', not 'ifa' in keeping with normal conventions. When 'ipstealth' is enabled returns from ip_input early, properly release the 'ia' reference. Reported by: lstewart, sam [1] MFC after: 6 weeks
# 194760	23-Jun-2009	rwatson	Modify most routines returning 'struct ifaddr *' to return references rather than pointers, requiring callers to properly dispose of those references. The following routines now return references: ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr Remove unused macro which didn't have required referencing: IFP_TO_IA6 This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references. Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed. Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions)
# 194660	22-Jun-2009	zec	V_irtualize flowtable state. This change should make options VIMAGE kernel builds usable again, to some extent at least. Note that the size of struct vnet_inet has changed, though in accordance with one-bump-per-day policy we didn't update the __FreeBSD_version number, given that it has already been touched by r194640 a few hours ago. Reviewed by: bz Approved by: julian (mentor)
# 194076	12-Jun-2009	bz	Move the kernel option FLOWTABLE chacking from the header file to the actual implementation. Remove the accessor functions for the compiled out case, just returning "unavail" values. Remove the kernel conditional from the header file as it is no longer needed, only leaving the externs. Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size under the kernel option as well. Reviewed by: rwatson
# 193511	05-Jun-2009	rwatson	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# 193502	05-Jun-2009	luigi	More cleanup in preparation of ipfw relocation (no actual code change): + move ipfw and dummynet hooks declarations to raw_ip.c (definitions in ip_var.h) same as for most other global variables. This removes some dependencies from ip_input.c; + remove the IPFW_LOADED macro, just test ip_fw_chk_ptr directly; + remove the DUMMYNET_LOADED macro, just test ip_dn_io_ptr directly; + move ip_dn_ruledel_ptr to ip_fw2.c which is the only file using it; To be merged together with rev 193497 MFC after: 5 days
# 193219	01-Jun-2009	rwatson	Reimplement the netisr framework in order to support parallel netisr threads: - Support up to one netisr thread per CPU, each processings its own workstream, or set of per-protocol queues. Threads may be bound to specific CPUs, or allowed to migrate, based on a global policy. In the future it would be desirable to support topology-centric policies, such as "one netisr per package". - Allow each protocol to advertise an ordering policy, which can currently be one of: NETISR_POLICY_SOURCE: packets must maintain ordering with respect to an implicit or explicit source (such as an interface or socket). NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work, as well as allowing protocols to provide a flow generation function for mbufs without flow identifers (m2flow). Falls back on NETISR_POLICY_SOURCE if now flow ID is available. NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for each packet handled by netisr (m2cpuid). - Provide utility functions for querying the number of workstreams being used, as well as a mapping function from workstream to CPU ID, which protocols may use in work placement decisions. - Add explicit interfaces to get and set per-protocol queue limits, and get and clear drop counters, which query data or apply changes across all workstreams. - Add a more extensible netisr registration interface, in which protocols declare 'struct netisr_handler' structures for each registered NETISR_ type. These include name, handler function, optional mbuf to flow ID function, optional mbuf to CPU ID function, queue limit, and ordering policy. Padding is present to allow these to be expanded in the future. If no queue limit is declared, then a default is used. - Queue limits are now per-workstream, and raised from the previous IFQ_MAXLEN default of 50 to 256. - All protocols are updated to use the new registration interface, and with the exception of netnatm, default queue limits. Most protocols register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use NETISR_POLICY_FLOW, and will therefore take advantage of driver- generated flow IDs if present. - Formalize a non-packet based interface between interface polling and the netisr, rather than having polling pretend to be two protocols. Provide two explicit hooks in the netisr worker for start and end events for runs: netisr_poll() and netisr_pollmore(), as well as a function, netisr_sched_poll(), to allow the polling code to schedule netisr execution. DEVICE_POLLING still embeds single-netisr assumptions in its implementation, so for now if it is compiled into the kernel, a single and un-bound netisr thread is enforced regardless of tunable configuration. In the default configuration, the new netisr implementation maintains the same basic assumptions as the previous implementation: a single, un-bound worker thread processes all deferred work, and direct dispatch is enabled by default wherever possible. Performance measurement shows a marginal performance improvement over the old implementation due to the use of batched dequeue. An rmlock is used to synchronize use and registration/unregistration using the framework; currently, synchronized use is disabled (replicating current netisr policy) due to a measurable 3%-6% hit in ping-pong micro-benchmarking. It will be enabled once further rmlock optimization has taken place. However, in practice, netisrs are rarely registered or unregistered at runtime. A new man page for netisr will follow, but since one doesn't currently exist, it hasn't been updated. This change is not appropriate for MFC, although the polling shutdown handler should be merged to 7-STABLE. Bump __FreeBSD_version. Reviewed by: bz
# 192893	27-May-2009	trasz	Don't discard packets with 'Destination Unreachable' at the beginning of ip_forward(), if the IPSEC is compiled in. It is possible that there is an SPD that this packets will go through, even if there is no matching route. If not, ICMP will be sent anyway, after ip_output(). This is somewhat similar in purpose to r191621, except that one was for the packets sent from the host, while this one is for packets being forwarded by the host. Reviewed by: bz@ Sponsored by: Wheel Sp. z o.o. (http://www.wheel.pl)
# 191816	05-May-2009	zec	Change the curvnet variable from a global const struct vnet , previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged. This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_ macros expand to whitespace. The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another. The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions. This change also introduces a DDB subcommand to show the list of all vnet instances. Approved by: julian (mentor)
# 191688	30-Apr-2009	zec	Permit buiding kernels with options VIMAGE, restricted to only a single active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net ) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_ macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_ family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor)
# 191314	20-Apr-2009	rwatson	In ip_input(), cache the received mbuf's network interface in a local variable. Acquire the interface address list lock when iterating over the interface address list searching for a matching received broadcast address. MFC after: 2 weeks
# 191259	19-Apr-2009	kmacy	- Allocate a small flowtable in ip_input.c (changeable by tuneable) - Use for accelerating ip_output
# 190951	11-Apr-2009	rwatson	Update stats in struct ipstat using four new macros, IPSTAT_ADD(), IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly manipulating the fields across the kernel. This will make it easier to change the implementation of these statistics, such as using per-CPU versions of the data structures. MFC after: 3 days
# 190909	11-Apr-2009	zec	Introduce vnet module registration / initialization framework with dependency tracking and ordering enforcement. With this change, per-vnet initialization functions introduced with r190787 are no longer directly called from traditional initialization functions (which cc in most cases inlined to pre-r190787 code), but are instead registered via the vnet framework first, and are invoked only after all prerequisite modules have been initialized. In the long run, this framework should allow us to both initialize and dismantle multiple vnet instances in a correct order. The problem this change aims to solve is how to replay the initialization sequence of various network stack components, which have been traditionally triggered via different mechanisms (SYSINIT, protosw). Note that this initialization sequence was and still can be subtly different depending on whether certain pieces of code have been statically compiled into the kernel, loaded as modules by boot loader, or kldloaded at run time. The approach is simple - we record the initialization sequence established by the traditional mechanisms whenever vnet_mod_register() is called for a particular vnet module. The vnet_mod_register_multi() variant allows a single initializer function to be registered multiple times but with different arguments - currently this is only used in kern/uipc_domain.c by net_add_domain() with different struct domain * as arguments, which allows for protosw-registered initialization routines to be invoked in a correct order by the new vnet initialization framework. For the purpose of identifying vnet modules, each vnet module has to have a unique ID, which is statically assigned in sys/vimage.h. Dynamic assignment of vnet module IDs is not supported yet. A vnet module may specify a single prerequisite module at registration time by filling in the vmi_dependson field of its vnet_modinfo struct with the ID of the module it depends on. Unless specified otherwise, all vnet modules depend on VNET_MOD_NET (container for ifnet list head, rt_tables etc.), which thus has to and will always be initialized first. The framework will panic if it detects any unresolved dependencies before completing system initialization. Detection of unresolved dependencies for vnet modules registered after boot (kldloaded modules) is not provided. Note that the fact that each module can specify only a single prerequisite may become problematic in the long run. In particular, INET6 depends on INET being already instantiated, due to TCP / UDP structures residing in INET container. IPSEC also depends on INET, which will in turn additionally complicate making INET6-only kernel configs a reality. The entire registration framework can be compiled out by turning on the VIMAGE_GLOBALS kernel config option. Reviewed by: bz Approved by: julian (mentor)
# 190787	06-Apr-2009	zec	First pass at separating per-vnet initializer functions from existing functions for initializing global state. At this stage, the new per-vnet initializer functions are directly called from the existing global initialization code, which should in most cases result in compiler inlining those new functions, hence yielding a near-zero functional change. Modify the existing initializer functions which are invoked via protosw, like ip_init() et. al., to allow them to be invoked multiple times, i.e. per each vnet. Global state, if any, is initialized only if such functions are called within the context of vnet0, which will be determined via the IS_DEFAULT_VNET(curvnet) check (currently always true). While here, V_irtualize a few remaining global UMA zones used by net/netinet/netipsec networking code. While it is not yet clear to me or anybody else whether this is the right thing to do, at this stage this makes the code more readable, and makes it easier to track uncollected UMA-zone-backed objects on vnet removal. In the long run, it's quite possible that some form of shared use of UMA zone pools among multiple vnets should be considered. Bump __FreeBSD_version due to changes in layout of structs vnet_ipfw, vnet_inet and vnet_net. Approved by: julian (mentor)
# 189592	09-Mar-2009	bms	Merge IGMPv3 and Source-Specific Multicast (SSM) to the FreeBSD IPv4 stack. Diffs are minimized against p4. PCS has been used for some protocol verification, more widespread testing of recorded sources in Group-and-Source queries is needed. sizeof(struct igmpstat) has changed. __FreeBSD_version is bumped to 800070.
# 189106	27-Feb-2009	bz	For all files including net/vnet.h directly include opt_route.h and net/route.h. Remove the hidden include of opt_route.h and net/route.h from net/vnet.h. We need to make sure that both opt_route.h and net/route.h are included before net/vnet.h because of the way MRT figures out the number of FIBs from the kernel option. If we do not, we end up with the default number of 1 when including net/vnet.h and array sizes are wrong. This does not change the list of files which depend on opt_route.h but we can identify them now more easily.
# 186119	15-Dec-2008	qingli	This main goals of this project are: 1. separating L2 tables (ARP, NDP) from the L3 routing tables 2. removing as much locking dependencies among these layers as possible to allow for some parallelism in the search operations 3. simplify the logic in the routing code, The most notable end result is the obsolescent of the route cloning (RTF_CLONING) concept, which translated into code reduction in both IPv4 ARP and IPv6 NDP related modules, and size reduction in struct rtentry{}. The change in design obsoletes the semantics of RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland applications such as "arp" and "ndp" have been modified to reflect those changes. The output from "netstat -r" shows only the routing entries. Quite a few developers have contributed to this project in the past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and Andre Oppermann. And most recently: - Kip Macy revised the locking code completely, thus completing the last piece of the puzzle, Kip has also been conducting active functional testing - Sam Leffler has helped me improving/refactoring the code, and provided valuable reviews - Julian Elischer setup the perforce tree for me and has helped me maintaining that branch before the svn conversion
# 185895	10-Dec-2008	zec	Conditionally compile out V_ globals while instantiating the appropriate container structures, depending on VIMAGE_GLOBALS compile time option. Make VIMAGE_GLOBALS a new compile-time option, which by default will not be defined, resulting in instatiations of global variables selected for V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be effectively compiled out. Instantiate new global container structures to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0, vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0. Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_ macros resolve either to the original globals, or to fields inside container structures, i.e. effectively #ifdef VIMAGE_GLOBALS #define V_rt_tables rt_tables #else #define V_rt_tables vnet_net_0._rt_tables #endif Update SYSCTL_V_*() macros to operate either on globals or on fields inside container structs. Extend the internal kldsym() lookups with the ability to resolve selected fields inside the virtualization container structs. This applies only to the fields which are explicitly registered for kldsym() visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently this is done only in sys/net/if.c. Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code, and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in turn result in proper code being generated depending on VIMAGE_GLOBALS. De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c which were prematurely V_irtualized by automated V_ prepending scripts during earlier merging steps. PF virtualization will be done separately, most probably after next PF import. Convert a few variable initializations at instantiation to initialization in init functions, most notably in ipfw. Also convert TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in initializer functions. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 185571	02-Dec-2008	bz	Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
# 185419	28-Nov-2008	zec	Unhide declarations of network stack virtualization structs from underneath #ifdef VIMAGE blocks. This change introduces some churn in #include ordering and nesting throughout the network stack and drivers but is not expected to cause any additional issues. In the next step this will allow us to instantiate the virtualization container structures and switch from using global variables to their "containerized" counterparts. Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 185088	19-Nov-2008	zec	Change the initialization methodology for global variables scheduled for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 183550	02-Oct-2008	zec	Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(). () netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 183388	26-Sep-2008	emaste	Move CTASSERT from header file to source file, per implementation note now in the CTASSERT man page. Submitted by: Ryan Stone
# 182146	25-Aug-2008	julian	Another V_ forgotten
# 181888	19-Aug-2008	julian	Fix some of the formatting fixes.. It's amazing how some thing stand out in a commit message.
# 181887	19-Aug-2008	julian	A bunch of formatting fixes brough to light by, or created by the Vimage commit a few days ago.
# 181803	17-Aug-2008	bz	Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
# 180239	03-Jul-2008	rwatson	Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific netisr handlers to always be dispatched via a queue (deferred). Mark the usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly acquire Giant in those handlers. Previously, any netisr handler not marked NETISR_MPSAFE would necessarily run deferred and with Giant acquired. This change removes Giant scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows non-MPSAFE handlers to continue to force deferred dispatch so as to avoid lock order reversals between their acqusition of Giant and any calling context. It is likely we will be able to remove NETISR_FORCEQUEUE once IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no longer be supported. Reviewed by: bz MFC after: 1 month X-MFC note: We can't remove NETISR_MPSAFE from stable/7 for KPI reasons, but the rest can go back.
# 180215	03-Jul-2008	bz	Remove a bogusly introduced rtalloc_ign() in rev. 1.335/SVN 178029, generating an RTM_MISS for every IP packet forwarded making user space routing daemons unhappy. PR: kern/123621, kern/124540, kern/122338 Reported by: Paul <paul gtcomm.net>, Mike Tancsa <mike sentex.net> on net@ Tested by: Paul and Mike Reviewed by: andre MFC after: 3 days
# 178888	09-May-2008	julian	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco
# 178029	09-Apr-2008	bz	Take the route mtu into account, if available, when sending an ICMP unreach, frag needed. Up to now we only looked at the interface MTU. Make sure to only use the minimum of the two. In case IPSEC is compiled in, loop the mtu through ip_ipsec_mtu() to avoid any further conditional maths. Without this, PMTU was broken in those cases when there was a route with a lower MTU than the MTU of the outgoing interface. PR: kern/122338 Tested by: Mark Cammidge mark peralex.com Reviewed by: silence on net@ MFC after: 2 weeks
# 174171	02-Dec-2007	guido	Consider the following situation: 1. A packet comes in that is to be forwarded 2. The destination of the packet is rewritten by some firewall code 3. The next link's MTU is too small 4. The packet has the DF bit set Then the current code is such that instead of setting the next link's MTU in the ICMP error, ip_next_mtu() is called and a guess is sent as to which MTU is supposed to be tried next. This is because in this case ip_forward() is called with srcrt set to 1. In that case the ia pointer remains NULL but it is needed to get the MTU of the interface the packet is to be sent out from. Thus, we always set ia to the outgoing interface. MFC after: 2 weeks
# 172930	24-Oct-2007	rwatson	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 172467	07-Oct-2007	silby	Add FBSDID to all files in netinet so that people can more easily include file version information in bug reports. Approved by: re (kensmith)
# 171732	05-Aug-2007	bz	Rename option IPSEC_FILTERGIF to IPSEC_FILTERTUNNEL. Also rename the related functions in a similar way. There are no functional changes. For a packet coming in with IPsec tunnel mode, the default is to only call into the firewall with the "outer" IP header and payload. With this option turned on, in addition to the "outer" parts, the "inner" IP header and payload are passed to the firewall too when going through ip_input() the second time. The option was never only related to a gif(4) tunnel within an IPsec tunnel and thus the name was very misleading. Discussed at: BSDCan 2007 Best new name suggested by: rwatson Reviewed by: rwatson Approved by: re (bmah)
# 171167	03-Jul-2007	gnn	Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC option is now deprecated, as well as the KAME IPsec code. What was FAST_IPSEC is now IPSEC. Approved by: re Sponsored by: Secure Computing
# 171133	01-Jul-2007	gnn	Commit IPv6 support for FAST_IPSEC to the tree. This commit includes only the kernel files, the rest of the files will follow in a second commit. Reviewed by: bz Approved by: re Supported by: Secure Computing
# 169625	16-May-2007	rwatson	Remove leading spaces before tabs spotted thanks to silby using kwrite to read ip_input.c.
# 169454	10-May-2007	rwatson	Move universally to ANSI C function declarations, with relatively consistent style(9)-ish layout.
# 167886	25-Mar-2007	rwatson	Replace a comment about RSVP/mrouting with a different but similar comment explaining that some more locking is needed. The routing pieces are done, but there is an interlocking issue between optionally compiled code and mandatory code. Spotted by: kris
# 167721	19-Mar-2007	andre	Match up SYSCTL declaration style.
# 166450	03-Feb-2007	bms	In regular forwarding path, reject packets destined for 169.254.0.0/16 link-local addresses. See RFC 3927 section 2.7.
# 163606	22-Oct-2006	rwatson	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 163548	20-Oct-2006	julian	revert last change.. premature.. need to wait until if_ethersubr.c uses pfil to get to ipfw.
# 163545	20-Oct-2006	julian	Move some variables to a more likely place and remove "temporary" stuff that is not needed any more.
# 161380	16-Aug-2006	julian	Remove the IPFIREWALL_FORWARD_EXTENDED option and make it on by default as it always was in older versions of FreeBSD. This option is pointless as it is needed in just about every interesting usage of forward that I have ever seen. It doesn't make the system any safer and just wastes huge amounts of develper time when the system doesn't behave as expected when code is moved from 4.x to 6.x It doesn't make the system any safer and just wastes huge amounts of develper time when the system doesn't behave as expected when code is moved from 4.x to 6.x or 7.x Reviewed by: glebius MFC after: 1 week
# 158470	12-May-2006	mlaier	Reintroduce net.inet6.ip6.fw.enable sysctl to dis/enable the ipv6 processing seperately. Also use pfil hook/unhook instead of keeping the check functions in pfil just to return there based on the sysctl. While here fix some whitespace on a nearby SYSCTL_ macro.
# 158303	05-May-2006	pjd	Force commit to provide correct commit message: Set 'fp' variable to NULL after freeing it, so it won't be dereferenced later. Found by: Coverity Prevent analysis tool CID: 993 MFC after: 2 weeks
# 158302	05-May-2006	pjd	/tmp/cvsTXPIwQ
# 157927	21-Apr-2006	ps	Allow for nmbclusters and maxsockets to be increased via sysctl. An eventhandler is used to update all the various zones that depend on these values.
# 155425	07-Feb-2006	oleg	Fix five years old bug in ip_reass(): if we are using 'full' (i.e. including pseudo header) hardware rx checksum offloading ip_reass() fails to calculate TCP/UDP checksum for reassembled packet correctly. This also should fix recent 'NFS over UDP over bge' issue exposed by if_bge.c rev. 1.123 Reviewed by: sam (earlier version), bde Approved by: glebius (mentor) MFC after: 2 weeks
# 155201	02-Feb-2006	csjp	Somewhat re-factor the read/write locking mechanism associated with the packet filtering mechanisms to use the new rwlock(9) locking API: - Drop the variables stored in the phil_head structure which were specific to conditions and the home rolled read/write locking mechanism. - Drop some includes which were used for condition variables - Drop the inline functions, and convert them to macros. Also, move these macros into pfil.h - Move pfil list locking macros intp phil.h as well - Rename ph_busy_count to ph_nhooks. This variable will represent the number of IN/OUT hooks registered with the pfil head structure - Define PFIL_HOOKED macro which evaluates to true if there are any hooks to be ran by pfil_run_hooks - In the IP/IP6 stacks, change the ph_busy_count comparison to use the new PFIL_HOOKED macro. - Drop optimization in pfil_run_hooks which checks to see if there are any hooks to be ran, and returns if not. This check is already performed by the IP stacks when they call: if (!PFIL_HOOKED(ph)) goto skip_hooks; - Drop in assertion which makes sure that the number of hooks never drops below 0 for good measure. This in theory should never happen, and if it does than there are problems somewhere - Drop special logic around PFIL_WAITOK because rw_wlock(9) does not sleep - Drop variables which support home rolled read/write locking mechanism from the IPFW firewall chain structure. - Swap out the read/write firewall chain lock internal to use the rwlock(9) API instead of our home rolled version - Convert the inlined functions to macros Reviewed by: mlaier, andre, glebius Thanks to: jhb for the new locking API
# 155179	01-Feb-2006	andre	Move the IPSEC related code blocks to their own file to unclutter and signifincantly improve the readability of ip_input() and ip_output() again. The resulting IPSEC hooks in ip_input() and ip_output() may be used later on for making IPSEC loadable. This move is mostly mechanical and should preserve current IPSEC behaviour as-is. Nothing shall prevent improvements in the way IPSEC interacts with the IPv4 stack. Discussed with: bz, gnn, rwatson; (earlier version)
# 154780	24-Jan-2006	andre	When doing IP forwarding with [FAST_]IPSEC compiled into the kernel ip_forward() would report back a zero MTU in ICMP needfrag messages because on a IPSEC SP lookup failure no MTU got computed. Fix this by changing the logic to compute a new MTU in any case if IPSEC didn't do it. Change MTU computation logic to use egress interface MTU if available or the next smaller MTU compared to the current packet size instead of falling back to a very small fixed MTU. Fix associated comment. PR: kern/91412 MFC after: 3 days
# 154400	15-Jan-2006	rwatson	Modify the IP fragment reassembly code so that it uses a new UMA zone, ipq_zone, to allocate fragment headers from, rather than using cast mbuf storage. This was one of the few remaining uses of mbuf storage for local data structures that relied on dtom(). Implement the resource limit on ipq's using UMA zone limits, but preserve current sysctl semantics using a sysctl proc. MFC after: 3 weeks
# 154395	15-Jan-2006	rwatson	Staticize ipqlock, since it is local to ip_input.c. MFC after: 3 days
# 153072	04-Dec-2005	ru	Fix -Wundef.
# 152612	19-Nov-2005	andre	Remove 'ipprintfs' which were protected under DIAGNOSTIC. It doesn't have any know to enable it from userland and could only be enabled by either setting it to 1 at compile time or through the kernel debugger. In the future it may be brought back as KTR tracing points. Discussed with: rwatson Sponsored by: TCP/IP Optimization Fundraise 2005
# 152592	18-Nov-2005	andre	Consolidate all IP Options handling functions into ip_options.[ch] and include ip_options.h into all files making use of IP Options functions. From ip_input.c rev 1.306: ip_dooptions(struct mbuf m, int pass) save_rte(m, option, dst) ip_srcroute(m0) ip_stripoptions(m, mopt) From ip_output.c rev 1.249: ip_insertoptions(m, opt, phlen) ip_optcopy(ip, jp) ip_pcbopts(struct inpcb inp, int optname, struct mbuf *m) No functional changes in this commit. Discussed with: rwatson Sponsored by: TCP/IP Optimization Fundraise 2005
# 152581	18-Nov-2005	andre	In ip_forward() copy as much into the temporary error mbuf as we have free space in it. Allocate correct mbuf from the beginning. This allows icmp_error() to quote the entire TCP header in error messages. Sponsored by: TCP/IP Optimization Fundraise 2005
# 152315	11-Nov-2005	ru	- Store pointer to the link-level address right in "struct ifnet" rather than in ifindex_table[]; all (except one) accesses are through ifp anyway. IF_LLADDR() works faster, and all (except one) ifaddr_byindex() users were converted to use ifp->if_addr. - Stop storing a (pointer to) Ethernet address in "struct arpcom", and drop the IFP2ENADDR() macro; all users have been converted to use IF_LLADDR() instead.
# 149635	30-Aug-2005	andre	Use the correct mbuf type for MGET().
# 148682	03-Aug-2005	rwatson	Introduce in_multi_mtx, which will protect IPv4-layer multicast address lists, as well as accessor macros. For now, this is a recursive mutex due code sequences where IPv4 multicast calls into IGMP calls into ip_output(), which then tests for a multicast forwarding case. For support macros in in_var.h to check multicast address lists, assert that in_multi_mtx is held. Acquire in_multi_mtx around iteration over the IPv4 multicast address lists, such as in ip_input() and ip_output(). Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses, as well as over the manipulation of ifnet multicast address lists in order to keep the two layers in sync. Lock down accesses to IPv4 multicast addresses in IGMP, or assert the lock when performing IGMP join/leave events. Eliminate spl's associated with IPv4 multicast addresses, portions of IGMP that weren't previously expunged by IGMP locking. Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded lock order in WITNESS, in that order. Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca> MFC after: 10 days
# 148155	19-Jul-2005	rwatson	Remove spl() calls from ip_slowtimo(), as IP fragment queue locking was merged several years ago. Submitted by: gnn MFC after: 1 day
# 145863	04-May-2005	andre	Pass icmp_error() the MTU argument directly instead of an interface pointer. This simplifies a couple of uses and removes some XXX workarounds.
# 144792	08-Apr-2005	maxim	o Nano optimize ip_reass() code path for the first fragment: do not try to reasseble the packet from the fragments queue with the only fragment, finish with the first fragment as soon as we create a queue. Spotted by: Vijay Singh o Drop the fragment if maxfragsperpacket == 0, no chances we will be able to reassemble the packet in future. Reviewed by: silby
# 143676	16-Mar-2005	sam	plug resource leak Noticed by: Coverity Prevent analysis tool
# 142268	22-Feb-2005	sam	fix potential invalid index into ip_protox array Noticed by: Coverity Prevent analysis tool
# 142248	22-Feb-2005	andre	Bring back the full packet destination manipulation for 'ipfw fwd' with the kernel compile time option: options IPFIREWALL_FORWARD_EXTENDED This option has to be specified in addition to IPFIRWALL_FORWARD. With this option even packets targeted for an IP address local to the host can be redirected. All restrictions to ensure proper behaviour for locally generated packets are turned off. Firewall rules have to be carefully crafted to make sure that things like PMTU discovery do not break. Document the two kernel options. PR: kern/71910 PR: kern/73129 MFC after: 1 week
# 142215	22-Feb-2005	glebius	Add CARP (Common Address Redundancy Protocol), which allows multiple hosts to share an IP address, providing high availability and load balancing. Original work on CARP done by Michael Shalayeff, with many additions by Marco Pfatschbacher and Ryan McBride. FreeBSD port done solely by Max Laier. Patch by: mlaier Obtained from: OpenBSD (mickey, mcbride)
# 141064	30-Jan-2005	rwatson	Prefer (NULL) spelling of (0) for pointers. MFC after: 3 days
# 139823	06-Jan-2005	imp	/* -> /*- for license, minor formatting changes
# 139558	01-Jan-2005	silby	Port randomization leads to extremely fast port reuse at high connection rates, which is causing problems for some users. To retain the security advantage of random ports and ensure correct operation for high connection rate users, disable port randomization during periods of high connection rates. Whenever the connection rate exceeds randomcps (10 by default), randomization will be disabled for randomtime (45 by default) seconds. These thresholds may be tuned via sysctl. Many thanks to Igor Sysoev, who proved the necessity of this change and tested many preliminary versions of the patch. MFC After: 20 seconds
# 136694	19-Oct-2004	andre	Support for dynamically loadable and unloadable IP protocols in the ipmux. With pr_proto_register() it has become possible to dynamically load protocols within the PF_INET domain. However the PF_INET domain has a second important structure called ip_protox[] that is derived from the 'struct protosw inetsw[]' and takes care of the de-multiplexing of the various protocols that ride on top of IP packets. The functions ipproto_[un]register() allow to dynamically adjust the ip_protox[] array mux in a consistent and easy way. To register a protocol within ip_protox[] the existence of a corresponding and matching protocol definition in inetsw[] is required. The function does not allow to overwrite an already registered protocol. The unregister function simply replaces the mux slot with the default index pointer to IPPROTO_RAW as it was previously.
# 135920	29-Sep-2004	mlaier	Add an additional struct inpcb * argument to pfil(9) in order to enable passing along socket information. This is required to work around a LOR with the socket code which results in an easy reproducible hard lockup with debug.mpsafenet=1. This commit does not fix the LOR, but enables us to do so later. The missing piece is to turn the filter locking into a leaf lock and will follow in a seperate (later) commit. This will hopefully be MT5'ed in order to fix the problem for RELENG_5 in forseeable future. Suggested by: rwatson A lot of work by: csjp (he'd be even more helpful w/o mentor-reviews ;) Reviewed by: rwatson, csjp Tested by: -pf, -ipfw, LINT, csjp and myself MFC after: 3 days LOR IDs: 14 - 17 (not fixed yet)
# 135731	24-Sep-2004	maxim	o Turn net.inet.ip.check_interface sysctl off by default. When net.inet.ip.check_interface was MFCed to RELENG_4 3+ years ago in rev. 1.130.2.17 ip_input.c it was 1 by default but shortly changed to 0 (accidently?) in rev. 1.130.2.20 in RELENG_4 only. Among with the fact this knob is not documented it breaks POLA especially in bridge environment. OK'ed by: andre Reviewed by: -current
# 135318	16-Sep-2004	andre	Fix an out of bounds write during the initialization of the PF_INET protocol family to the ip_protox[] array. The protocol number of IPPROTO_DIVERT is larger than IPPROTO_MAX and was initializing memory beyond the array. Catch all these kinds of errors by ignoring protocols that are higher than IPPROTO_MAX or 0 (zero). Add more comments ip_init().
# 135275	15-Sep-2004	andre	Clarify some comments for the M_FASTFWD_OURS case in ip_input().
# 135274	15-Sep-2004	andre	Remove the last two global variables that are used to store packet state while it travels through the IP stack. This wasn't much of a problem because IP source routing is disabled by default but when enabled together with SMP and preemption it would have very likely cross-corrupted the IP options in transit. The IP source route options of a packet are now stored in a mtag instead of the global variable.
# 134383	27-Aug-2004	andre	Always compile PFIL_HOOKS into the kernel and remove the associated kernel compile option. All FreeBSD packet filters now use the PFIL_HOOKS API and thus it becomes a standard part of the network stack. If no hooks are connected the entire packet filter hooks section and related activities are jumped over. This removes any performance impact if no hooks are active. Both OpenBSD and DragonFlyBSD have integrated PFIL_HOOKS permanently as well.
# 134022	19-Aug-2004	andre	Bring back the sysctl 'net.inet.ip.fw.enable' to unbreak the startup scripts and to be able to disable ipfw if it was compiled directly into the kernel.
# 133923	18-Aug-2004	rwatson	Fix build of ip_input.c with "options IPSEC" -- the "pass:" label is used with both FAST_IPSEC and IPSEC, but was defined for only FAST_IPSEC.
# 133920	17-Aug-2004	andre	Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland and preserves the ipfw ABI. The ipfw core packet inspection and filtering functions have not been changed, only how ipfw is invoked is different. However there are many changes how ipfw is and its add-on's are handled: In general ipfw is now called through the PFIL_HOOKS and most associated magic, that was in ip_input() or ip_output() previously, is now done in ipfw_check_[in\|out]() in the ipfw PFIL handler. IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to be diverted is checked if it is fragmented, if yes, ip_reass() gets in for reassembly. If not, or all fragments arrived and the packet is complete, divert_packet is called directly. For 'tee' no reassembly attempt is made and a copy of the packet is sent to the divert socket unmodified. The original packet continues its way through ip_input/output(). ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet with the new destination sockaddr_in. A check if the new destination is a local IP address is made and the m_flags are set appropriately. ip_input() and ip_output() have some more work to do here. For ip_input() the m_flags are checked and a packet for us is directly sent to the 'ours' section for further processing. Destination changes on the input path are only tagged and the 'srcrt' flag to ip_forward() is set to disable destination checks and ICMP replies at this stage. The tag is going to be handled on output. ip_output() again checks for m_flags and the 'ours' tag. If found, the packet will be dropped back to the IP netisr where it is going to be picked up by ip_input() again and the directly sent to the 'ours' section. When only the destination changes, the route's 'dst' is overwritten with the new destination from the forward m_tag. Then it jumps back at the route lookup again and skips the firewall check because it has been marked with M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with 'option IPFIREWALL_FORWARD' to enable it. DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will then inject it back into ip_input/ip_output() after it has served its time. Dummynet packets are tagged and will continue from the next rule when they hit the ipfw PFIL handlers again after re-injection. BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS. More detailed changes to the code: conf/files Add netinet/ip_fw_pfil.c. conf/options Add IPFIREWALL_FORWARD option. modules/ipfw/Makefile Add ip_fw_pfil.c. net/bridge.c Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw is still directly invoked to handle layer2 headers and packets would get a double ipfw when run through PFIL_HOOKS as well. netinet/ip_divert.c Removed divert_clone() function. It is no longer used. netinet/ip_dummynet.[ch] Neither the route 'ro' nor the destination 'dst' need to be stored while in dummynet transit. Structure members and associated macros are removed. netinet/ip_fastfwd.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_fw.h Removed 'ro' and 'dst' from struct ip_fw_args. netinet/ip_fw2.c (Re)moved some global variables and the module handling. netinet/ip_fw_pfil.c New file containing the ipfw PFIL handlers and module initialization. netinet/ip_input.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. ip_forward() does not longer require the 'next_hop' struct sockaddr_in argument. Disable early checks if 'srcrt' is set. netinet/ip_output.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. netinet/ip_var.h Add ip_reass() as general function. (Used from ipfw PFIL handlers for IPDIVERT.) netinet/raw_ip.c Directly check if ipfw and dummynet control pointers are active. netinet/tcp_input.c Rework the 'ipfw forward' to local code to work with the new way of forward tags. netinet/tcp_sack.c Remove include 'opt_ipfw.h' which is not needed here. sys/mbuf.h Remove m_claim_next() macro which was exclusively for ipfw 'forward' and is no longer needed. Approved by: re (scottl)
# 133720	14-Aug-2004	dwmalone	Get rid of the RANDOM_IP_ID option and make it a sysctl. NetBSD have already done this, so I have styled the patch on their work: 1) introduce a ip_newid() static inline function that checks the sysctl and then decides if it should return a sequential or random IP ID. 2) named the sysctl net.inet.ip.random_id 3) IPv6 flow IDs and fragment IDs are now always random. Flow IDs and frag IDs are significantly less common in the IPv6 world (ie. rarely generated per-packet), so there should be smaller performance concerns. The sysctl defaults to 0 (sequential IP IDs). Reviewed by: andre, silby, mlaier, ume Based on: NetBSD MFC after: 2 months
# 133557	12-Aug-2004	andre	Fix two cases of incorrect IPQ_UNLOCK'ing in the merged ip_reass() function. The first one was going to 'dropfrag', which unlocks the IPQ, before the lock was aquired; The second one doing a unlock and then a 'goto dropfrag' which led to a double-unlock. Tripped over by: des
# 133481	11-Aug-2004	andre	Consistently use NULL for pointer comparisons.
# 133390	09-Aug-2004	andre	Make a comment that IP source routing is not SMP and PREEMPTION safe.
# 133069	03-Aug-2004	andre	o Move all parts of the IP reassembly process into the function ip_reass() to make it fully self-contained. o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len including the IP header. o Computation of the delayed checksum is moved into divert_packet(). Reviewed by: silby
# 131840	08-Jul-2004	brian	Change the following environment variables to kernel options: bootp -> BOOTP bootp.nfsroot -> BOOTP_NFSROOT bootp.nfsv3 -> BOOTP_NFSV3 bootp.compat -> BOOTP_COMPAT bootp.wired_to -> BOOTP_WIRED_TO - i.e. back out the previous commit. It's already possible to pxeboot(8) with a GENERIC kernel. Pointed out by: dwmalone
# 131814	08-Jul-2004	brian	Change the following kernel options to environment variables: BOOTP -> bootp BOOTP_NFSROOT -> bootp.nfsroot BOOTP_NFSV3 -> bootp.nfsv3 BOOTP_COMPAT -> bootp.compat BOOTP_WIRED_TO -> bootp.wired_to This lets you PXE boot with a GENERIC kernel by putting this sort of thing in loader.conf: bootp="YES" bootp.nfsroot="YES" bootp.nfsv3="YES" bootp.wired_to="bge1" or even setting the variables manually from the OK prompt.
# 130685	18-Jun-2004	bms	Check that m->m_pkthdr.rcvif is not NULL before checking if a packet was received on a broadcast address on the input path. Under certain circumstances this could result in a panic, notably for locally-generated packets which do not have m_pkthdr.rcvif set. This is a similar situation to that which is solved by src/sys/netinet/ip_icmp.c rev 1.66. PR: kern/52935
# 130581	16-Jun-2004	bms	In ip_forward(), when calculating the MTU in effect for an IPSEC transport mode tunnel, take the per-route MTU into account, if and only if it is non-zero (as found in struct rt_metrics/rt_metrics_lite). PR: kern/42727 Obtained from: NetBSD (ip_input.c rev 1.151)
# 130580	16-Jun-2004	bms	In ip_forward(), set m->m_pkthdr.len correctly such that the mbuf chain is sane, and ipsec4_getpolicybyaddr() will therefore complete. PR: kern/42727 Obtained from: KAME (kame/freebsd4/sys/netinet/ip_input.c rev 1.42)
# 130416	13-Jun-2004	mlaier	Link ALTQ to the build and break with ABI for struct ifnet. Please recompile your (network) modules as well as any userland that might make sense of sizeof(struct ifnet). This does not change the queueing yet. These changes will follow in a seperate commit. Same with the driver changes, which need case by case evaluation. __FreeBSD_version bump will follow. Tested-by: (i386)LINT
# 129017	06-May-2004	andre	Provide the sysctl net.inet.ip.process_options to control the processing of IP options. net.inet.ip.process_options=0 Ignore IP options and pass packets unmodified. net.inet.ip.process_options=1 Process all IP options (default). net.inet.ip.process_options=2 Reject all packets with IP options with ICMP filter prohibited message. This sysctl affects packets destined for the local host as well as those only transiting through the host (routing). IP options do not have any legitimate purpose anymore and are only used to circumvent firewalls or to exploit certain behaviours or bugs in TCP/IP stacks. Reviewed by: sam (mentor)
# 128829	02-May-2004	darrenr	Rename m_claim_next_hop() to m_claim_next(), as suggested by Max Laier.
# 128816	02-May-2004	darrenr	Rename ip_claim_next_hop() to m_claim_next_hop(), give it an extra arg (the type of tag to claim) and push it out of ip_var.h into mbuf.h alongside all of the other macros that work ok mbuf's and tag's.
# 128019	07-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
# 127535	28-Mar-2004	rwatson	Invert the logic of NET_LOCK_GIANT(), and remove the one reference to it. Previously, Giant would be grabbed at entry to the IP local delivery code when debug.mpsafenet was set to true, as that implied Giant wouldn't be grabbed in the driver path. Now, we will use this primitive to conditionally grab Giant in the event the entire network stack isn't running MPSAFE (debug.mpsafenet == 0).
# 126467	01-Mar-2004	rwatson	Rename NET_PICKUP_GIANT() to NET_LOCK_GIANT(), and NET_DROP_GIANT() to NET_UNLOCK_GIANT(). While they are used in similar ways, the semantics are quite different -- NET_LOCK_GIANT() and NET_UNLOCK_GIANT() directly wrap mutex lock and unlock operations, whereas drop/pickup special case the handling of Giant recursion. Add a comment saying as much. Add NET_ASSERT_GIANT(), which conditionally asserts Giant based on the value of debug_mpsafenet.
# 126368	28-Feb-2004	rwatson	Remove unneeded {} originally used to hold local variables for dummynet in a code block, as the variable is now gone. Submitted by: sam
# 126239	25-Feb-2004	mlaier	Re-remove MT_TAGs. The problems with dummynet have been fixed now. Tested by: -current, bms(mentor), me Approved by: bms(mentor), sam
# 125952	17-Feb-2004	mlaier	Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is not working properly with the patch in place. Approved by: bms(mentor)
# 125785	13-Feb-2004	mlaier	Do not check receive interface when pfil(9) hook changed address. Approved by: bms(mentor)
# 125784	13-Feb-2004	mlaier	This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing them mostly with packet tags (one case is handled by using an mbuf flag since the linkage between "caller" and "callee" is direct and there's no need to incur the overhead of a packet tag). This is (mostly) work from: sam Silence from: -arch Approved by: bms(mentor), sam, rwatson
# 125264	31-Jan-2004	phk	Introduce the SO_BINTIME option which takes a high-resolution timestamp at packet arrival. For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL since it has higher resolution and lower overhead. Simultaneous use of the two options is possible and they will return consistent timestamps. This introduces an extra test and a function call for SO_TIMEVAL, but I have not been able to measure that.
# 122996	26-Nov-2003	andre	Make sure all uses of stack allocated struct route's are properly zeroed. Doing a bzero on the entire struct route is not more expensive than assigning NULL to ro.ro_rt and bzero of ro.ro_dst. Reviewed by: sam (mentor) Approved by: re (scottl)
# 122922	20-Nov-2003	andre	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
# 122921	20-Nov-2003	andre	Remove RTF_PRCLONING from routing table and adjust users of it accordingly. The define is left intact for ABI compatibility with userland. This is a pre-step for the introduction of tcp_hostcache. The network stack remains fully useable with this change. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
# 122828	17-Nov-2003	green	Fix a few cases where MT_TAG-type "fake mbufs" are created on the stack, but do not have mh_nextpkt initialized. Somtimes what's there is "1", and the ip_input() code pukes trying to m_free() it, rendering divert sockets and such broken. This really underscores the need to get rid of MT_TAG. Reviewed by: rwatson
# 122723	14-Nov-2003	andre	Make ipstealth global as we need it in ip_fastforward too.
# 122708	14-Nov-2003	andre	Remove the global one-level rtcache variable and associated complex locking and rework ip_rtaddr() to do its own rtlookup. Adopt all its callers to this and make ip_output() callable with NULL rt pointer. Reviewed by: sam (mentor)
# 122702	14-Nov-2003	andre	Introduce ip_fastforward and remove ip_flow. Short description of ip_fastforward: o adds full direct process-to-completion IPv4 forwarding code o handles ip fragmentation incl. hw support (ip_flow did not) o sends icmp needfrag to source if DF is set (ip_flow did not) o supports ipfw and ipfilter (ip_flow did not) o supports divert, ipfw fwd and ipfilter nat (ip_flow did not) o returns anything it can't handle back to normal ip_input Enable with sysctl -w net.inet.ip.fastforwarding=1 Reviewed by: sam (mentor)
# 122334	08-Nov-2003	sam	replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF macros that expand to include assertions when the system is built with INVARIANTS Supported by: FreeBSD Foundation
# 122320	08-Nov-2003	sam	o add a flags parameter to netisr_register that is used to specify whether or not the isr needs to hold Giant when running; Giant-less operation is also controlled by the setting of debug_mpsafenet o mark all netisr's except NETISR_IP as needing Giant o add a GIANT_REQUIRED assertion to the top of netisr's that need Giant o pickup Giant (when debug_mpsafenet is 1) inside ip_input before calling up with a packet o change netisr handling so swi_net runs w/o Giant; instead we grab Giant before invoking handlers based on whether the handler needs Giant o change netisr handling so that netisr's that are marked MPSAFE may have multiple instances active at a time o add netisr statistics for packets dropped because the isr is inactive Supported by: FreeBSD Foundation
# 122179	06-Nov-2003	sam	Fix locking of the ip forwarding cache. We were holding a reference to a routing table entry w/o bumping the reference count or locking against the entry being free'd. This caused major havoc (for some reason it appeared most frequently for folks running natd). Fix is to bump the reference count whenever we copy the route cache contents into a private copy so the entry cannot be reclaimed out from under us. This is a short term fix as the forthcoming routing table changes will eliminate this cache entirely. Supported by: FreeBSD Foundation
# 122062	04-Nov-2003	ume	- cleanup SP refcnt issue. - share policy-on-socket for listening socket. - don't copy policy-on-socket at all. secpolicy no longer contain spidx, which saves a lot of memory. - deep-copy pcb policy if it is an ipsec policy. assign ID field to all SPD entries. make it possible for racoon to grab SPD entry on pcb. - fixed the order of searching SA table for packets. - fixed to get a security association header. a mode is always needed to compare them. - fixed that the incorrect time was set to sadb_comb_{hard\|soft}_usetime. - disallow port spec for tunnel mode policy (as we don't reassemble). - an user can define a policy-id. - clear enc/auth key before freeing. - fixed that the kernel crashed when key_spdacquire() was called because key_spdacquire() had been implemented imcopletely. - preparation for 64bit sequence number. - maintain ordered list of SA, based on SA id. - cleanup secasvar management; refcnt is key.c responsibility; alloc/free is keydb.c responsibility. - cleanup, avoid double-loop. - use hash for spi-based lookup. - mark persistent SP "persistent". XXX in theory refcnt should do the right thing, however, we have "spdflush" which would touch all SPs. another solution would be to de-register persistent SPs from sptree. - u_short -> u_int16_t - reduce kernel stack usage by auto variable secasindex. - clarify function name confusion. ipsec__policy -> ipsec__pcbpolicy. - avoid variable name confusion. (struct inpcbpolicy )pcb_sp, spp (struct secpolicy ), sp (struct secpolicy ) - count number of ipsec encapsulations on ipsec4_output, so that we can tell ip_output() how to handle the packet further. - When the value of the ul_proto is ICMP or ICMPV6, the port field in "src" of the spidx specifies ICMP type, and the port field in "dst" of the spidx specifies ICMP code. - avoid from applying IPsec transport mode to the packets when the kernel forwards the packets. Tested by: nork Obtained from: KAME
# 121971	03-Nov-2003	rwatson	Remove comment about desire for eventual explicit labeling of ICMP header copy made on input path: this is now handled differently. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 121684	29-Oct-2003	ume	add ECN support in layer-3. - implement the tunnel egress rule in ip_ecn_egress() in ip_ecn.c. make ip{,6}_ecn_egress() return integer to tell the caller that this packet should be dropped. - handle ECN at fragment reassembly in ip_input.c and frag6.c. Obtained from: KAME
# 121141	16-Oct-2003	sam	pfil hooks can modify packet contents so check if the destination address has been changed when PFIL_HOOKS is enabled and, if it has, arrange for the proper action by ip*_forward. Supported by: FreeBSD Foundation Submitted by: Pyun YongHyeon
# 121119	15-Oct-2003	sam	purge extraneous ';'s Supported by: FreeBSD Foundation Noticed by: bde
# 121093	14-Oct-2003	sam	Lock ip forwarding route cache. While we're at it, remove the global variable ipforward_rt by introducing an ip_forward_cacheinval() call to use to invalidate the cache. Supported by: FreeBSD Foundation
# 121091	14-Oct-2003	sam	remove dangling ';'s` that were harmless Supported by: FreeBSD Foundation
# 120386	23-Sep-2003	sam	o update PFIL_HOOKS support to current API used by netbsd o revamp IPv4+IPv6+bridge usage to match API changes o remove pfil_head instances from protosw entries (no longer used) o add locking o bump FreeBSD version for 3rd party modules Heavy lifting by: "Max Laier" <max@love2party.net> Supported by: FreeBSD Foundation Obtained from: NetBSD (bits of pfil.h and pfil.c)
# 119753	04-Sep-2003	sam	lock ip fragment queues Submitted by: Robert Watson <rwatson@freebsd.org> Obtained from: BSD/OS
# 117897	22-Jul-2003	sam	add IPSEC_FILTERGIF suport for FAST_IPSEC PR: kern/51922 Submitted by: Eric Masson <e-masson@kisoft-services.com> MFC after: 1 week
# 116462	17-Jun-2003	silby	Map icmp time exceeded responses to EHOSTUNREACH rather than 0 (no error); this makes connect act more sensibly in these cases. PR: 50839 Submitted by: Barney Wolff <barney@pit.databus.com> Patch delayed by laziness of: silby MFC after: 1 week
# 115909	06-Jun-2003	rwatson	When setting fragment queue pointers to NULL, or comparing them with NULL, use NULL rather than 0 to improve readability.
# 114788	06-May-2003	rwatson	Trim a call to mac_create_mbuf_from_mbuf() since m_tag meta-data copying for mbuf headers now works properly in m_dup_pkthdr(), so we don't need to do an explicit copy. Approved by: re (jhb) Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 114258	29-Apr-2003	mdodd	IP_RECVTTL socket option. Reviewed by: Stuart Cheshire <cheshire@apple.com>
# 113255	08-Apr-2003	des	Introduce an M_ASSERTPKTHDR() macro which performs the very common task of asserting that an mbuf has a packet header. Use it instead of hand- rolled versions wherever applicable. Submitted by: Hiten Pandya <hiten@unixdaemons.com>
# 112985	02-Apr-2003	mdodd	Back out support for RFC3514. RFC3514 poses an unacceptale risk to compliant systems.
# 112973	02-Apr-2003	mdodd	Sync constant define with NetBSD. Requested by: Tom Spindler <dogcow@babymeat.com>
# 112929	01-Apr-2003	mdodd	Implement support for RFC 3514 (The Security Flag in the IPv4 Header). (See: ftp://ftp.rfc-editor.org/in-notes/rfc3514.txt) This fulfills the host requirements for userland support by way of the setsockopt() IP_EVIL_INTENT message. There are three sysctl tunables provided to govern system behavior. net.inet.ip.rfc3514: Enables support for rfc3514. As this is an Informational RFC and support is not yet widespread this option is disabled by default. net.inet.ip.hear_no_evil If set the host will discard all received evil packets. net.inet.ip.speak_no_evil If set the host will discard all transmitted evil packets. The IP statistics counter 'ips_evil' (available via 'netstat') provides information on the number of 'evil' packets recieved. For reference, the '-E' option to 'ping' has been provided to demonstrate and test the implementation.
# 112675	26-Mar-2003	rwatson	Modify the mac_init_ipq() MAC Framework entry point to accept an additional flags argument to indicate blocking disposition, and pass in M_NOWAIT from the IP reassembly code to indicate that blocking is not OK when labeling a new IP fragment reassembly queue. This should eliminate some of the WITNESS warnings that have started popping up since fine-grained IP stack locking started going in; if memory allocation fails, the creation of the fragment queue will be aborted. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 111888	04-Mar-2003	jlemon	Update netisr handling; Each SWI now registers its queue, and all queue drain routines are done by swi_net, which allows for better queue control at some future point. Packets may also be directly dispatched to a netisr instead of queued, this may be of interest at some installations, but currently defaults to off. Reviewed by: hsu, silby, jayanth, sam Sponsored by: DARPA, NAI Labs
# 111541	26-Feb-2003	silby	Fix a condition so that ip reassembly queues are emptied immediately when maxfragpackets is dropped to 0. Noticed by: bmah
# 111479	25-Feb-2003	maxim	style(9): join lines.
# 111478	25-Feb-2003	maxim	Ip reassembly queue structure has ipq_nfrags now. Count a number of dropped ip fragments precisely. Reviewed by: silby
# 111275	22-Feb-2003	sam	Add a new config option IPSEC_FILTERGIF to control whether or not packets coming out of a GIF tunnel are re-processed by ipfw, et. al. By default they are not reprocessed. With the option they are. This reverts 1.214. Prior to that change packets were not re-processed. After they were which caused problems because packets do not have distinguishing characteristics (like a special network if) that allows them to be filtered specially. This is really a stopgap measure designed for immediate MFC so that 4.8 has consistent handling to what was in 4.7. PR: 48159 Reviewed by: Guido van Rooij <guido@gvr.org> MFC after: 1 day
# 111244	22-Feb-2003	silby	Add the ability to limit the number of IP fragments allowed per packet, and enable it by default, with a limit of 16. At the same time, tweak maxfragpackets downward so that in the worst possible case, IP reassembly can use only 1/2 of all mbuf clusters. MFC after: 3 days Reviewed by: hsu Liked by: bmah
# 111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
# 110178	01-Feb-2003	silby	Move a comment and optimize the frag timeout code a slight bit. Submitted by: maxim MFC with: The previous two revisions
# 109965	28-Jan-2003	silby	A few fixes to rev 1.221 - Honor the previous behavior of maxfragpackets = 0 or -1 - Take a better stab at fragment statistics - Move / correct a comment Suggested by: maxim@ MFC after: 7 days
# 109843	25-Jan-2003	silby	Merge the best parts of maxfragpackets and maxnipq together. (Both functions implemented approximately the same limits on fragment memory usage, but in different fashions.) End user visible changes: - Fragment reassembly queues are freed in a FIFO manner when maxfragpackets has been reached, rather than all reassembly stopping. MFC after: 5 days
# 109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 108466	30-Dec-2002	sam	Correct mbuf packet header propagation. Previously, packet headers were sometimes propagated using M_COPY_PKTHDR which actually did something between a "move" and a "copy" operation. This is replaced by M_MOVE_PKTHDR (which copies the pkthdr contents and "removes" it from the source mbuf) and m_dup_pkthdr which copies the packet header contents including any m_tag chain. This corrects numerous problems whereby mbuf tags could be lost during packet manipulations. These changes also introduce arguments to m_tag_copy and m_tag_copy_chain to specify if the tag copy work should potentially block. This introduces an incompatibility with openbsd which we may want to revisit. Note that move/dup of packet headers does not handle target mbufs that have a cluster bound to them. We may want to support this; for now we watch for it with an assert. Finally, M_COPYFLAGS was updated to include M_FIRSTFRAG\|M_LASTFRAG. Supported by: Vernier Networks Reviewed by: Robert Watson <rwatson@FreeBSD.org>
# 107114	20-Nov-2002	luigi	Move fw_one_pass from ip_fw2.c to ip_input.c so that neither bridge.c nor if_ethersubr.c depend on IPFIREWALL. Restore the use of fw_one_pass in if_ethersubr.c ipfw.8 will be updated with a separate commit. Approved by: re
# 107081	19-Nov-2002	silby	Add a sysctl to control the generation of source quench packets, and set it to 0 by default. Partially obtained from: NetBSD Suggested by: David Gilbert MFC after: 5 days
# 106968	15-Nov-2002	luigi	Massive cleanup of the ip_mroute code. No functional changes, but: + the mrouting module now should behave the same as the compiled-in version (it did not before, some of the rsvp code was not loaded properly); + netinet/ip_mroute.c is now truly optional; + removed some redundant/unused code; + changed many instances of '0' to NULL and INADDR_ANY as appropriate; + removed several static variables to make the code more SMP-friendly; + fixed some minor bugs in the mrouting code (mostly, incorrect return values from functions). This commit is also a prerequisite to the addition of support for PIM, which i would like to put in before DP2 (it does not change any of the existing APIs, anyways). Note, in the process we found out that some device drivers fail to properly handle changes in IFF_ALLMULTI, leading to interesting behaviour when a multicast router is started. This bug is not corrected by this commit, and will be fixed with a separate commit. Detailed changes: -------------------- netinet/ip_mroute.c all the above. conf/files make ip_mroute.c optional net/route.c fix mrt_ioctl hook netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here together with other rsvp code, and a couple of indentation fixes. netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks netinet/ip_var.h rsvp function hooks netinet/raw_ip.c hooks for mrouting and rsvp functions, plus interface cleanup. netinet/ip_mroute.h remove an unused and optional field from a struct Most of the code is from Pavlin Radoslavov and the XORP project Reviewed by: sam MFC after: 1 week
# 105586	20-Oct-2002	phk	Fix two instances of variant struct definitions in sys/netinet: Remove the never completed _IP_VHL version, it has not caught on anywhere and it would make us incompatible with other BSD netstacks to retain this version. Add a CTASSERT protecting sizeof(struct ip) == 20. Don't let the size of struct ipq depend on the IPDIVERT option. This is a functional no-op commit. Approved by: re
# 105218	16-Oct-2002	guido	Get rid of checking for ip sec history. It is true that packets are not supposed to be checked by the firewall rules twice. However, because the various ipsec handlers never call ip_input(), this never happens anyway. This fixes the situation where a gif tunnel is encrypted with IPsec. In such a case, after IPsec processing, the unencrypted contents from the GIF tunnel are fed back to the ipintrq and subsequently handeld by ip_input(). Yet, since there still is IPSec history attached, the packets coming out from the gif device are never fed into the filtering code. This fix was sent to Itojun, and he pointed towartds http://www.netbsd.org/Documentation/network/ipsec/#ipf-interaction. This patch actually implements what is stated there (specifically: Packet came from tunnel devices (gif(4) and ipip(4)) will still go through ipf(4). You may need to identify these packets by using interface name directive in ipf.conf(5). Reviewed by: rwatson MFC after: 3 weeks
# 105199	16-Oct-2002	sam	Tie new "Fast IPsec" code into the build. This involves the usual configuration stuff as well as conditional code in the IPv4 and IPv6 areas. Everything is conditional on FAST_IPSEC which is mutually exclusive with IPSEC (KAME IPsec implmentation). As noted previously, don't use FAST_IPSEC with INET6 at the moment. Reviewed by: KAME, rwatson Approved by: silence Supported by: Vernier Networks
# 105194	15-Oct-2002	sam	Replace aux mbufs with packet tags: o instead of a list of mbufs use a list of m_tag structures a la openbsd o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit ABI/module number cookie o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and use this in defining openbsd-compatible m_tag_find and m_tag_get routines o rewrite KAME use of aux mbufs in terms of packet tags o eliminate the most heavily used aux mbufs by adding an additional struct inpcb parameter to ip_output and ip6_output to allow the IPsec code to locate the security policy to apply to outbound packets o bump __FreeBSD_version so code can be conditionalized o fixup ipfilter's call to ip_output based on __FreeBSD_version Reviewed by: julian, luigi (silent), -arch, -net, darren Approved by: julian, silence from everyone else Obtained from: openbsd (mostly) MFC after: 1 month
# 104774	10-Oct-2002	maxim	Fix IPOPT_TS processing: do not overwrite IP address by timestamp. PR: misc/42121 Submitted by: Praveen Khurjekar <praveen@codito.com> Reviewed by: silence on -net MFC after: 1 month
# 104094	28-Sep-2002	phk	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
# 103553	18-Sep-2002	phk	Use m_fixhdr() rather than roll our own.
# 103479	17-Sep-2002	maxim	Explicitly clear M_FRAG flag on a mbuf with the last fragment to unbreak ip fragments reassembling for loopback interface. Discussed with: bde, jlemon Reviewed by: silence on -net MFC after: 2 weeks
# 101268	03-Aug-2002	luigi	Fix handling of packets which matched an "ipfw fwd" rule on the input side.
# 101239	02-Aug-2002	rwatson	When preserving the IP header in extra mbuf in the IP forwarding case, also preserve the MAC label. Note that this mbuf allocation is fairly non-optimal, but not my fault. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 101095	31-Jul-2002	rwatson	Introduce support for Mandatory Access Control and extensible kernel access control. Instrument the code managing IP fragment reassembly queues (struct ipq) to invoke appropriate MAC entry points to maintain a MAC label on each queue. Permit MAC policies to associate information with a queue based on the mbuf that caused it to be created, update that information based on further mbufs accepted by the queue, influence the decision making process by which mbufs are accepted to the queue, and set the label of the mbuf holding the reassembled datagram following reassembly completetion. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 98904	27-Jun-2002	mux	Warning fixes for 64 bits platforms. With this last fix, I can build a GENERIC sparc64 kernel with -Werror. Reviewed by: luigi
# 98701	23-Jun-2002	luigi	Move some global variables in more appropriate places. Add XXX comments to mark places which need to be taken care of if we want to remove this part of the kernel from Giant. Add a comment on a potential performance problem with ip_forward()
# 98666	23-Jun-2002	luigi	fix bad indentation and whitespace resulting from cut&paste
# 98613	22-Jun-2002	luigi	Remove (almost all) global variables that were used to hold packet forwarding state ("annotations") during ip processing. The code is considerably cleaner now. The variables removed by this change are: ip_divert_cookie used by divert sockets ip_fw_fwd_addr used for transparent ip redirection last_pkt used by dynamic pipes in dummynet Removal of the first two has been done by carrying the annotations into volatile structs prepended to the mbuf chains, and adding appropriate code to add/remove annotations in the routines which make use of them, i.e. ip_input(), ip_output(), tcp_input(), bdg_forward(), ether_demux(), ether_output_frame(), div_output(). On passing, remove a bug in divert handling of fragmented packet. Now it is the fragment at offset 0 which sets the divert status of the whole packet, whereas formerly it was the last incoming fragment to decide. Removal of last_pkt required a change in the interface of ip_fw_chk() and dummynet_io(). On passing, use the same mechanism for dummynet annotations and for divert/forward annotations. option IPFIREWALL_FORWARD is effectively useless, the code to implement it is very small and is now in by default to avoid the obfuscation of conditionally compiled code. NOTES: * there is at least one global variable left, sro_fwd, in ip_output(). I am not sure if/how this can be removed. * I have deliberately avoided gratuitous style changes in this commit to avoid cluttering the diffs. Minor stule cleanup will likely be necessary * this commit only focused on the IP layer. I am sure there is a number of global variables used in the TCP and maybe UDP stack. * despite the number of files touched, there are absolutely no API's or data structures changed by this commit (except the interfaces of ip_fw_chk() and dummynet_io(), which are internal anyways), so an MFC is quite safe and unintrusive (and desirable, given the improved readability of the code). MFC after: 10 days
# 97658	31-May-2002	tanimura	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# 97074	21-May-2002	arr	- Change the newly turned INVARIANTS #ifdef blocks (they were changed from DIAGNOSTIC yesterday) into KASSERT()'s as these help to increase code readability.
# 97018	20-May-2002	arr	- Turn a #ifdef DIAGNOSTIC to #ifdef INVARIANTS as the code from this line through the #endif is really a sanity check. Reviewed by: jake
# 96972	20-May-2002	tanimura	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# 96432	11-May-2002	dd	s/demon/daemon/
# 96245	09-May-2002	luigi	Cleanup the interface to ip_fw_chk, two of the input arguments were totally useless and have been removed. ip_input.c, ip_output.c: Properly initialize the "ip" pointer in case the firewall does an m_pullup() on the packet. Remove some debugging code forgotten long ago. ip_fw.[ch], bridge.c: Prepare the grounds for matching MAC header fields in bridged packets, so we can have 'etherfw' functionality without a lot of kernel and userland bloat.
# 93818	04-Apr-2002	jhb	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 92723	19-Mar-2002	alfred	Remove __P.
# 91271	26-Feb-2002	jedgar	Enforce inbound IPsec SPD Reviewed by: fenner
# 90868	18-Feb-2002	mike	o Move NTOHL() and associated macros into <sys/param.h>. These are deprecated in favor of the POSIX-defined lowercase variants. o Change all occurrences of NTOHL() and associated marcros in the source tree to use the lowercase function variants. o Add missing license bits to sparc64's <machine/endian.h>. Approved by: jake o Clean up <machine/endian.h> files. o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>. o Remove prototypes for non-existent bswapXX() functions. o Include <machine/endian.h> in <arpa/inet.h> to define the POSIX-required ntohl() family of functions. o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>, and <sys/param.h>. o Prepend underscores to the ntohl() family to help deal with complexities associated with having MD (asm and inline) versions, and having to prevent exposure of these functions in other headers that happen to make use of endian-specific defines. o Create weak aliases to the canonical function name to help deal with third-party software forgetting to include an appropriate header. o Remove some now unneeded pollution from <sys/types.h>. o Add missing <arpa/inet.h> includes in userland. Tested on: alpha, i386 Reviewed by: bde, jake, tmm
# 89809	26-Jan-2002	cjc	The ipfw(8) 'tee' action simply hasn't worked on incoming packets for some time. _All_ packets, regardless of destination, were accepted by the machine as if addressed to it. Jump back to 'pass' processing for a teed packet instead of falling through as if it was ours. PR: kern/31130 Reviewed by: -net, luigi MFC after: 2 weeks
# 89069	08-Jan-2002	msmith	Initialise the intrq_present fields at runtime, not link time. This allows us to load protocols at runtime, and avoids the use of common variables. Also fix the ip6_intrq assignment so that it works at all.
# 88665	29-Dec-2001	yar	Don't reveal a router in the IPSTEALTH mode through IP options. The following steps are involved: a) the IP options related to routing (LSRR and SSRR) are processed as though the router were a host, b) the other IP options are processed as usual only if the packet is destined for the router; otherwise they are ignored. PR: kern/23123 Discussed in: freebsd-hackers
# 88593	28-Dec-2001	julian	Fix ipfw fwd so that it acts as the docs say when forwarding an incoming packet to another machine. Obtained from: Vicor Production tree MFC after: 3 weeks
# 87915	14-Dec-2001	jlemon	minor style and whitespace fixes.
# 87120	30-Nov-2001	ru	- Make ip_rtaddr() global, and use it to look up the correct source address in icmp_reflect(). - Two new "struct icmpstat" members: icps_badaddr and icps_noroute. PR: kern/31575 Obtained from: BSD/OS MFC after: 1 week
# 86047	04-Nov-2001	luigi	MFS: sync the ipfw/dummynet/bridge code with the one recently merged into stable (mostly , but not only, formatting and comments changes).
# 85467	25-Oct-2001	jlemon	Don't use the ip_timestamp structure to access timestamp options, as the compiler may cause an unaligned access to be generated in some cases. PR: 30982
# 84516	05-Oct-2001	ps	Make it so dummynet and bridge can be loaded as modules. Submitted by: billf
# 84102	29-Sep-2001	jlemon	Add a hash table that contains the list of internet addresses, and use this in place of the in_ifaddr list when appropriate. This improves performance on hosts which have a large number of IP aliases.
# 84101	29-Sep-2001	jlemon	Centralize satosin(), sintosa() and ifatoia() macros in <netinet/in.h> Remove local definitions.
# 84058	27-Sep-2001	luigi	Two main changes here: + implement "limit" rules, which permit to limit the number of sessions between certain host pairs (according to masks). These are a special type of stateful rules, which might be of interest in some cases. See the ipfw manpage for details. + merge the list pointers and ipfw rule descriptors in the kernel, so the code is smaller, faster and more readable. This patch basically consists in replacing "foo->rule->bar" with "rule->bar" all over the place. I have been willing to do this for ages! MFC after: 1 week
# 83934	25-Sep-2001	brooks	Make faith loadable, unloadable, and clonable.
# 83130	06-Sep-2001	jlemon	Wrap array accesses in macros, which also happen to be lvalues: ifnet_addrs[i - 1] -> ifaddr_byindex(i) ifindex2ifnet[i] -> ifnet_byindex(i) This is intended to ease the conversion to SMPng.
# 82884	03-Sep-2001	julian	Patches from Keiichi SHIMA <keiichi@iij.ad.jp> to make ip use the standard protosw structure again. Obtained from: Well, KAME I guess.
# 82445	27-Aug-2001	jesper	When net.inet.tcp.icmp_may_rst is enabled, report ECONNREFUSED not ENETRESET to the application as a RST would, this way we're compatible with the most applications. MFC candidate. Submitted by: Scott Renfro <scott@renfro.org> Reviewed by: Mike Silbersack <silby@silby.com>
# 78667	23-Jun-2001	ru	Add netstat(1) knob to reset net.inet.{ip\|icmp\|tcp\|udp\|igmp}.stats. For example, ``netstat -s -p ip -z'' will show and reset IP stats. PR: bin/17338
# 78183	13-Jun-2001	ume	This is force commit intend to correct previous log. `(possible) remote kernel panic fix' was already fixed in 1.132. I had some confusion during gathering log.
# 78089	11-Jun-2001	ume	This is force commit to mention about previous commit. - (possible) remote kernel panic fix - out of bounds access on ill-formed ipopt. - strict boundary check on ipopt. - make sure to enforce inbound IPsec policy on all final header. - add missing ipcomp entry from ipprotosw. - 127/8 must not appear on wire - RFC1122. this is rather important as we use weak host model, so outsider can abuse 127.0.0.1 from outside. - introduce ipstat.ips_badaddr - use ipsec_gethist() to prevent packet filters from looking at decapulated packets. - remove duplicate 127.0.0.0/8 checking.
# 78064	11-Jun-2001	ume	Sync with recent KAME. This work was based on kame-20010528-freebsd43-snap.tgz and some critical problem after the snap was out were fixed. There are many many changes since last KAME merge. TODO: - The definitions of SADB_* in sys/net/pfkeyv2.h are still different from RFC2407/IANA assignment because of binary compatibility issue. It should be fixed under 5-CURRENT. - ip6po_m member of struct ip6_pktopts is no longer used. But, it is still there because of binary compatibility issue. It should be removed under 5-CURRENT. Reviewed by: itojun Obtained from: KAME MFC after: 3 weeks
# 77969	10-Jun-2001	jesper	Make the default value of net.inet.ip.maxfragpackets and net.inet6.ip6.maxfragpackets dependent on nmbclusters, defaulting to nmbclusters / 4 Reviewed by: bde MFC after: 1 week
# 77665	03-Jun-2001	jesper	Prevent denial of service using bogus fragmented IPv4 packets. A attacker sending a lot of bogus fragmented packets to the target (with different IPv4 identification field - ip_id), may be able to put the target machine into mbuf starvation state. By setting a upper limit on the number of reassembly queues we prevent this situation. This upper limit is controlled by the new sysctl net.inet.ip.maxfragpackets which defaults to 200, as the IPv6 case, this should be sufficient for most systmes, but you might want to increase it if you have lots of TCP sessions. I'm working on making the default value dependent on nmbclusters. If you want old behaviour (no upper limit) set this sysctl to a negative value. If you don't want to accept any fragments (not recommended) set the sysctl to 0 (zero). Obtained from: NetBSD MFC after: 1 week
# 77574	01-Jun-2001	kris	Add ``options RANDOM_IP_ID'' which randomizes the ID field of IP packets. This closes a minor information leak which allows a remote observer to determine the rate at which the machine is generating packets, since the default behaviour is to increment a counter for each packet sent. Reviewed by: -net Obtained from: OpenBSD
# 77572	01-Jun-2001	obrien	Back out jesper's 2001/05/31 14:58:11 PDT commit. It does not compile.
# 77545	31-May-2001	jesper	Prevent denial of service using bogus fragmented IPv4 packets. A attacker sending a lot of bogus fragmented packets to the target (with different IPv4 identification field - ip_id), may be able to put the target machine into mbuf starvation state. By setting a upper limit on the number of reassembly queues we prevent this situation. This upper limit is controlled by the new sysctl net.inet.ip.maxfragpackets which defaults to NMBCLUSTERS/4 If you want old behaviour (no upper limit) set this sysctl to a negative value. If you don't want to accept any fragments (not recommended) set the sysctl to 0 (zero) Obtained from: NetBSD (partially) MFC after: 1 week
# 74454	19-Mar-2001	ru	Invalidate cached forwarding route (ipforward_rt) whenever a new route is added to the routing table, otherwise we may end up using the wrong route when forwarding. PR: kern/10778 Reviewed by: silence on -net
# 74415	18-Mar-2001	ru	Make sure the cached forwarding route (ipforward_rt) is still up before using it. Not checking this may have caused the wrong IP address to be used when processing certain IP options (see example below). This also caused the wrong route to be passed to ip_output() when forwarding, but fortunately ip_output() is smart enough to detect this. This example demonstrates the wrong behavior of the Record Route option observed with this bug. Host ``freebsd'' is acting as the gateway for the ``sysv''. 1. On the gateway, we add the route to the destination. The new route will use the primary address of the loopback interface, 127.0.0.1: : freebsd# route add 10.0.0.66 -iface lo0 -reject : add host 10.0.0.66: gateway lo0 2. From the client, we ping the destination. We see the correct replies. Please note that this also causes the relevant route on the ``freebsd'' gateway to be cached in ipforward_rt variable: : sysv# ping -snv 10.0.0.66 : PING 10.0.0.66: 56 data bytes : ICMP Host Unreachable from gateway 192.168.0.115 : ICMP Host Unreachable from gateway 192.168.0.115 : ICMP Host Unreachable from gateway 192.168.0.115 : : ----10.0.0.66 PING Statistics---- : 3 packets transmitted, 0 packets received, 100% packet loss 3. On the gateway, we delete the route to the destination, thus making the destination reachable through the `default' route: : freebsd# route delete 10.0.0.66 : delete host 10.0.0.66 4. From the client, we ping destination again, now with the RR option turned on. The surprise here is the 127.0.0.1 in the first reply. This is caused by the bug in ip_rtaddr() not checking the cached route is still up befor use. The debug code also shows that the wrong (down) route is further passed to ip_output(). The latter detects that the route is down, and replaces the bogus route with the valid one, so we see the correct replies (192.168.0.115) on further probes: : sysv# ping -snRv 10.0.0.66 : PING 10.0.0.66: 56 data bytes : 64 bytes from 10.0.0.66: icmp_seq=0. time=10. ms : IP options: <record route> 127.0.0.1, 10.0.0.65, 10.0.0.66, : 192.168.0.65, 192.168.0.115, 192.168.0.120, : 0.0.0.0(Current), 0.0.0.0, 0.0.0.0 : 64 bytes from 10.0.0.66: icmp_seq=1. time=0. ms : IP options: <record route> 192.168.0.115, 10.0.0.65, 10.0.0.66, : 192.168.0.65, 192.168.0.115, 192.168.0.120, : 0.0.0.0(Current), 0.0.0.0, 0.0.0.0 : 64 bytes from 10.0.0.66: icmp_seq=2. time=0. ms : IP options: <record route> 192.168.0.115, 10.0.0.65, 10.0.0.66, : 192.168.0.65, 192.168.0.115, 192.168.0.120, : 0.0.0.0(Current), 0.0.0.0, 0.0.0.0 : : ----10.0.0.66 PING Statistics---- : 3 packets transmitted, 3 packets received, 0% packet loss : round-trip (ms) min/avg/max = 0/3/10
# 74362	16-Mar-2001	phk	<sys/queue.h> makeover.
# 73996	08-Mar-2001	iedowse	It was possible for ip_forward() to supply to icmp_error() an IP header with ip_len in network byte order. For certain values of ip_len, this could cause icmp_error() to write beyond the end of an mbuf, causing mbuf free-list corruption. This problem was observed during generation of ICMP redirects. We now make quite sure that the copy of the IP header kept for icmp_error() is stored in a non-shared mbuf header so that it will not be modified by ip_output(). Also: - Calculate the correct number of bytes that need to be retained for icmp_error(), instead of assuming that 64 is enough (it's not). - In icmp_error(), use m_copydata instead of bcopy() to copy from the supplied mbuf chain, in case the first 8 bytes of IP payload are not stored directly after the IP header. - Sanity-check ip_len in icmp_error(), and panic if it is less than sizeof(struct ip). Incoming packets with bad ip_len values are discarded in ip_input(), so this should only be triggered by bugs in the code, not by bad packets. This patch results from code and suggestions from Ruslan, Bosko, Jonathan Lemon and Matt Dillon, with important testing by Mike Tancsa, who could reproduce this problem at will. Reported by: Mike Tancsa <mike@sentex.net> Reviewed by: ru, bmilekic, jlemon, dillon
# 73791	05-Mar-2001	truckman	Modify the comments to more closely resemble the English language.
# 73626	05-Mar-2001	truckman	Move the loopback net check closer to the beginning of ip_input() so that it doesn't block packets whose destination address has been translated to the loopback net by ipnat. Add warning comments about the ip_checkinterface feature.
# 73402	04-Mar-2001	truckman	Disable interface checking for packets subject to "ipfw fwd". Chris Johnson <cjohnson@palomine.net> tested this fix in -stable.
# 73399	03-Mar-2001	truckman	Disable interface checking when IP forwarding is engaged so that packets addressed to the interface on the other side of the box follow their historical path. Explicitly block packets sent to the loopback network sent from the outside, which is consistent with the behavior of the forwarding path between interfaces as implemented in in_canforward(). Always check the arrival interface when matching the packet destination against the interface broadcast addresses. This bug allowed TCP connections to be made to the broadcast address of an interface on the far side of the system because the M_BCAST flag was not set because the packet was unicast to the interface on the near side. This was broken when the directed broadcast code was removed from revision 1.32. If the directed broadcast code was stil present, the destination would not have been recognized as local until the packet was forwarded to the output interface and ether_output() looped a copy back to ip_input() with M_BCAST set and the receive interface set to the output interface. Optimize the order of the tests. Reviewed by: jlemon
# 73357	02-Mar-2001	jlemon	Add a new sysctl net.inet.ip.check_interface, which will verify that an incoming packet arrivees on an interface that has an address matching the packet's address. This is turned on by default.
# 73172	27-Feb-2001	jlemon	When iterating over our list of interface addresses in order to determine if an arriving packet belongs to us, also check that the packet arrived through the correct interface. Skip this check if the packet was locally generated.
# 72959	23-Feb-2001	jlemon	Allow ICMP unreachables which map into PRC_UNREACH_ADMIN_PROHIB to reset TCP connections which are in the SYN_SENT state, if the sequence number in the echoed ICMP reply is correct. This behavior can be controlled by the sysctl net.inet.tcp.icmp_may_rst. Currently, only subtypes 2,3,10,11,12 are treated as such (port, protocol and administrative unreachables). Assocaiate an error code with these resets which is reported to the user application: ENETRESET. Disallow resetting TCP sessions which are not in a SYN_SENT state. Reviewed by: jesper, -net
# 72803	21-Feb-2001	jesper	Backout change in 1.153, as it violate rfc1122 section 3.2.1.3. Requested by: jlemon,ru
# 72775	20-Feb-2001	jesper	Send a ICMP unreachable instead of dropping the packet silent, if we receive a packet not for us, and forwarding disabled. PR: kern/24512 Reviewed by: jlemon Approved by: jlemon
# 72012	04-Feb-2001	phk	Another round of the <sys/queue.h> FOREACH transmogriffer. Created with: sed(1) Reviewed by: md5(1)
# 71999	04-Feb-2001	phk	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
# 71909	01-Feb-2001	luigi	MFS: bridge/ipfw/dummynet fixes (bridge.c will be committed separately)
# 69152	25-Nov-2000	jlemon	Lock down the network interface queues. The queue mutex must be obtained before adding/removing packets from the queue. Also, the if_obytes and if_omcasts fields should only be manipulated under protection of the mutex. IF_ENQUEUE, IF_PREPEND, and IF_DEQUEUE perform all necessary locking on the queue. An IF_LOCK macro is provided, as well as the old (mutex-less) versions of the macros in the form _IF_ENQUEUE, _IF_QFULL, for code which needs them, but their use is discouraged. Two new macros are introduced: IF_DRAIN() to drain a queue, and IF_HANDOFF, which takes care of locking/enqueue, and also statistics updating/start if necessary.
# 68169	01-Nov-2000	ru	Wrong checksum used for certain reassembled IP packets before diverting.
# 67708	27-Oct-2000	phk	Convert all users of fldoff() to offsetof(). fldoff() is bad because it only takes a struct tag which makes it impossible to use unions, typedefs etc. Define __offsetof() in <machine/ansi.h> Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h> Remove myriad of local offsetof() definitions. Remove includes of <stddef.h> in kernel code. NB: Kernelcode should never include from /usr/include ! Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API. Deprecate <struct.h> with a warning. The warning turns into an error on 01-12-2000 and the file gets removed entirely on 01-01-2001. Paritials reviews by: various. Significant brucifications by: bde
# 67620	26-Oct-2000	ru	RFC 791 says that IP_RF bit should always be zero, but nothing in the code enforces this. So, do not check for and attempt a false reassembly if only IP_RF is set. Also, removed the dead code, since we no longer use dtom() on return from ip_reass().
# 67609	26-Oct-2000	ru	Wrong header length used for certain reassembled IP packets. This was first fixed in rev 1.82 but then broken in rev 1.125. PR: 6177
# 67334	19-Oct-2000	joe	Augment the 'ifaddr' structure with a 'struct if_data' to keep statistics on a per network address basis. Teach the IPv4 and IPv6 input/output routines to log packets/bytes against the network address connected to the flow. Teach netstat to display the per-address stats for IP protocols when 'netstat -i' is evoked, instead of displaying the per-interface stats.
# 67026	12-Oct-2000	ru	Backout my wrong attempt to fix the compilation warning in ip_input.c and instead reapply the revision 1.49 of mbuf.h, i.e. Fixed regression of the type of the `header' member of struct pkthdr from `void *' to caddr_t in rev.1.51. This mainly caused an annoying warning for compiling ip_input.c. Requested by: bde
# 67009	12-Oct-2000	ru	Fix the compilation warning.
# 65859	14-Sep-2000	jlemon	m_cat() can free its second argument, so collect the checksum information from the fragment before calling m_cat().
# 65837	14-Sep-2000	ru	Follow BSD/OS and NetBSD, keep the ip_id field in network order all the time. Requested by: wollman
# 65327	01-Sep-2000	ru	Fixed broken ICMP error generation, unified conversion of IP header fields between host and network byte order. The details: o icmp_error() now does not add IP header length. This fixes the problem when icmp_error() is called from ip_forward(). In this case the ip_len of the original IP datagram returned with ICMP error was wrong. o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host byte order, so DTRT and convert these fields back to network byte order before sending a message. This fixes the problem described in PR 16240 and PR 20877 (ip_id field was returned in host byte order). o ip_ttl decrement operation in ip_forward() was moved down to make sure that it does not corrupt the copy of original IP datagram passed later to icmp_error(). o A copy of original IP datagram in ip_forward() was made a read-write, independent copy. This fixes the problem I first reported to Garrett Wollman and Bill Fenner and later put in audit trail of PR 16240: ip_output() (not always) converts fields of original datagram to network byte order, but because copy (mcopy) and its original (m) most likely share the same mbuf cluster, ip_output()'s manipulations on original also corrupted the copy. o ip_output() now expects all three fields, ip_len, ip_off and (what is significant) ip_id in host byte order. It was a headache for years that ip_id was handled differently. The only compatibility issue here is the raw IP socket interface with IP_HDRINCL socket option set and a non-zero ip_id field, but ip.4 manual page was unclear on whether in this case ip_id field should be in host or network byte order.
# 64075	31-Jul-2000	ache	Nonexistent <sys/pfil.h> -> <net/pfil.h> Kernel 'make depend' fails otherwise
# 64060	31-Jul-2000	darrenr	activate pfil_hooks and covert ipfilter to use it
# 62587	04-Jul-2000	itojun	sync with kame tree as of july00. tons of bug fixes/improvements. API changes: - additional IPv6 ioctls - IPsec PF_KEY API was changed, it is mandatory to upgrade setkey(8). (also syntax change)
# 61183	02-Jun-2000	jlemon	Add boundary checks against IP options. Obtained from: OpenBSD
# 60661	17-May-2000	jlemon	Cast sizeof() calls to be of type (int) when they appear in a signed integer expression. Otherwise the sizeof() call will force the expression to be evaluated as unsigned, which is not the intended behavior. Obtained from: NetBSD (in a different form)
# 60612	15-May-2000	ru	Do not call icmp_error() if ipfirewall(4) denied packet. PR: kern/10747, kern/18382
# 60304	09-May-2000	itojun	correct more out-of-bounds memory access, if cnt == 1 and optlen > 1. similar to recent fix to sys/netinet/ipf.c (by darren).
# 58698	27-Mar-2000	jlemon	Add support for offloading IP/TCP/UDP checksums to NIC hardware which supports them.
# 57401	23-Feb-2000	guido	Remove option IPFILTER_KLD. In case you wanted to kldload ipfilter, the module would only work in kernels built with this option. Approved by: jkh
# 57178	13-Feb-2000	peter	Clean up some loose ends in the network code, including the X.25 and ISO #ifdefs. Clean out unused netisr's and leftover netisr linker set gunk. Tested on x86 and alpha, including world. Approved by: jkh
# 57117	10-Feb-2000	luigi	Move definition of fw_enable from ip_fw.c to ip_input.c so we can compile kernels without IPFIREWALL . Reported-by: Robert Watson Approved-by: jordan
# 57114	10-Feb-2000	luigi	Support the net.inet.ip.fw.enable variable, part of the recent ipfw modifications. Approved-by: jordan
# 56555	24-Jan-2000	brian	Move the intrq variables into net/intrq.c and unconditionally include this in all kernels. Declare some const intrq_present variables that can be checked by a module prior to using *intrq to queue data. Make the if_tun module capable of processing atm, ip, ip6, ipx, natm and netatalk packets when TUNSIFHEAD is ioctl()d on. Review not required by: freebsd-hackers
# 55009	22-Dec-1999	shin	IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 54221	06-Dec-1999	guido	Revive mlfk_ipl here. This version is slightly changed from the old one: an unnecessary define (KLD_MODULE) has been deleted and the initialisation of the module is done after domaininit was called to be sure inet is running. Some slight changed were made to ip_auth.c and ip_state.c in order to assure including of sys/systm.h in case we make a kld Make sure ip_fil does nmot include osreldate in kernel mode Remove mlfk_ipl.c from here: no sources allowed in these directories!
# 54175	05-Dec-1999	archie	Miscellaneous fixes/cleanups relating to ipfw and divert(4): - Implement 'ipfw tee' (finally) - Divert packets by calling new function divert_packet() directly instead of going through protosw[]. - Replace kludgey global variable 'ip_divert_port' with a function parameter to divert_packet() - Replace kludgey global variable 'frag_divert_port' with a function parameter to ip_reass() - style(9) fixes Reviewed by: julian, green
# 50561	29-Aug-1999	des	Include the correct header for the IPSTEALTH option.
# 50477	27-Aug-1999	peter	$Id$ -> $FreeBSD$
# 47546	27-May-1999	dg	Made net.inet.ip.intr_queue_maxlen writeable.
# 46420	04-May-1999	luigi	Free the dummynet descriptor in ip_dummynet, not in the called routines. The descriptor contains parameters which could be used within those routines (eg. ip_output() ). On passing, add IPPROTO_PGM entry to netinet/in.h
# 46381	03-May-1999	billf	Add sysctl descriptions to many SYSCTL_XXXs PR: kern/11197 Submitted by: Adrian Chadd <adrian@FreeBSD.org> Reviewed by: billf(spelling/style/minor nits) Looked at by: bde(style)
# 45869	20-Apr-1999	peter	Tidy up some stray / unused stuff in the IPFW package and friends. - unifdef -DCOMPAT_IPFW (this was on by default already) - remove traces of in-kernel ip_nat package, it was never committed. - Make IPFW and DUMMYNET initialize themselves rather than depend on compiled-in hooks in ip_init(). This means they initialize the same way both in-kernel and as kld modules. (IPFW initializes now :-)
# 44677	11-Mar-1999	julian	Fix the 'fwd' option to ipfw when asked to divert to another machine. also rely less on other modules clearing static values, and clear them in a few cases we missed before. Submitted by: Matthew Reimer <mreimer@vpop.net>
# 44219	22-Feb-1999	des	Add support for stealth forwarding (forwarding packets without touching their ttl). This can be used - in combination with the proper ipfw incantations - to make a firewall or router invisible to traceroute and other exploration tools. This behaviour is controlled by a sysctl variable (net.inet.ip.stealth) and hidden behind a kernel option (IPSTEALTH). Reviewed by: eivind, bde
# 43802	09-Feb-1999	wollman	After wading in the cesspool of ip_input for an hour, I have managed to convince myself that nothing will break if we permit IP input while interface addresses are unconfigured. (At worst, they will hit some ULP's PCB scan and fail if nobody is listening.) So, remove the restriction that addresses must be configured before packets can be input. Assume that any unicast packet we receive while unconfigured is potentially ours.
# 43305	27-Jan-1999	dillon	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 43066	22-Jan-1999	wollman	Don't forward unicast packets received via link-layer multicast. Suggested by: fenner Original complaint: Shiva Shenoy <Shiva.Shenoy@yagosys.com>
# 42574	12-Jan-1999	eivind	Add #ifdef's to avoid unused label warning in some cases.
# 41993	21-Dec-1998	luigi	Recover from previous dummynet screwup
# 41793	14-Dec-1998	luigi	Last bits (i think) of dummynet for -current.
# 41591	07-Dec-1998	archie	The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static and local variables, goto labels, and functions declared but not defined.
# 41201	16-Nov-1998	dfr	Make the previous fix more portable. Requested by: bde
# 41177	15-Nov-1998	dfr	Fix printf format errors on alpha.
# 41096	11-Nov-1998	dg	Be sure to pullup entire IP header when dealing with fragment packets.
# 40670	27-Oct-1998	dfr	Some optimisations to the fragment reassembly code. Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com>
# 40669	27-Oct-1998	dfr	Fix a bug in the new fragment reassembly code which was tickled by recieving a fragment which wholly overlapped one or more existing fragments. Submitted by: Don Lewis <Don.Lewis@tsc.tdk.com>
# 40435	16-Oct-1998	peter	gulp. Jordan specifically OK'ed this.. This is the bulk of the support for doing kld modules. Two linker_sets were replaced by SYSINIT()'s. VFS's and exec handlers are self registered. kld is now a superset of lkm. I have converted most of them, they will follow as a seperate commit as samples. This all still works as a static a.out kernel using LKM's.
# 39043	10-Sep-1998	dfr	Ensure that m_nextpkt is set to NULL after reassembling fragments.
# 38513	24-Aug-1998	dfr	Re-implement tcp and ip fragment reassembly to not store pointers in the ip header which can't work on alpha since pointers are too big. Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>
# 38482	23-Aug-1998	wollman	Yow! Completely change the way socket options are handled, eliminating another specialized mbuf type in the process. Also clean up some of the cruft surrounding IPFW, multicast routing, RSVP, and other ill-explored corners.
# 38373	16-Aug-1998	bde	Fixed printf format errors.
# 37624	13-Jul-1998	bde	Fixed some longs that should have been fixed-sized types.
# 37498	08-Jul-1998	dg	When not acting as a router (ipforwarding=0), silently discard source routed packets that aren't destined for us, as required by RFC-1122. PR: 7191
# 37434	06-Jul-1998	julian	oops ended comment before the comment ended..
# 37433	06-Jul-1998	julian	Bring back some slight cleanups from 2.2
# 37412	06-Jul-1998	julian	Fix braino in switching to TAILQ macro.
# 37409	06-Jul-1998	julian	Support for IPFW based transparent forwarding. Any packet that can be matched by a ipfw rule can be redirected transparently to another port or machine. Redirection to another port mostly makes sense with tcp, where a session can be set up between a proxy and an unsuspecting client. Redirection to another machine requires that the other machine also be expecting to receive the forwarded packets, as their headers will not have been modified. /sbin/ipfw must be recompiled!!! Reviewed by: Peter Wemm <peter@freebsd.org> Submitted by: Chrisy Luke <chrisy@flix.net>
# 37332	02-Jul-1998	julian	Remove the option to keep IPFW diversion backwards compatible WRT diversion reinjection. No-one has been bitten by the new behaviour that I know of.
# 36908	12-Jun-1998	julian	Go through the loopback code with a broom.. Remove lots'o'hacks. looutput is now static. Other callers who want to use loopback to allow shortcutting should call the special entrypoint for this, if_simloop(), which is specifically designed for this purpose. Using looutput for this purpose was problematic, particularly with bpf and trying to keep track of whether one should be using the charateristics of the loopback interface or the interface (e.g. if_ethersubr.c) that was requesting the loopback. There was a whole class of errors due to this mis-use each of which had hacks to cover them up. Consists largly of hack removal :-)
# 36710	06-Jun-1998	julian	Make sure the default value of a dummy variable is 0 so that it doesn't do anything.
# 36708	06-Jun-1998	julian	Fix wrong data type for a pointer.
# 36707	06-Jun-1998	julian	clean up the changes made to ipfw over the last weeks (should make the ipfw lkm work again)
# 36678	05-Jun-1998	julian	Reverse the default sense of the IPFW/DIVERT reinjection code so that the new behaviour is now default. Solves the "infinite loop in diversion" problem when more than one diversion is active. Man page changes follow. The new code is in -stable as the NON default option.
# 36369	25-May-1998	julian	Add optional code to change the way that divert and ipfw work together. Prior to this change, Accidental recursion protection was done by the diverted daemon feeding back the divert port number it got the packet on, as the port number on a sendto(). IPFW knew not to redivert a packet to this port (again). Processing of the ruleset started at the beginning again, skipping that divert port. The new semantic (which is how we should have done it the first time) is that the port number in the sendto() is the rule number AFTER which processing should restart, and on a recvfrom(), the port number is the rule number which caused the diversion. This is much more flexible, and also more intuitive. If the user uses the same sockaddr received when resending, processing resumes at the rule number following that that caused the diversion. The user can however select to resume rule processing at any rule. (0 is restart at the beginning) To enable the new code use option IPFW_DIVERT_RESTART This should become the default as soon as people have looked at it a bit
# 36330	24-May-1998	dg	The ipt_ptr field is 1-based (see TCP/IP Illustrated, Vol. 1, pp. 91-95), so it must be adjusted (minus 1) before using it to do the length check. I'm not sure who to give the credit to, but the bug was reported by Jennifer Dawn Myers <jdm@enteract.com>, who also supplied a patch. It was also fixed in OpenBSD previously by andreas.gunnarsson@emw.ericsson.se, and of course I did the homework to verify that the fix was correct per the specification. PR: 6738
# 36192	19-May-1998	dg	Added fast IP forwarding code by Matt Thomas <matt@3am-software.com> via NetBSD, ported to FreeBSD by Pierre Beyssac <pb@fasterix.freenix.org> and minorly tweaked by me. This is a standard part of FreeBSD, but must be enabled with: "sysctl -w net.inet.ip.fastforwarding=1" ...and of course forwarding must also be enabled. This should probably be modified to use the zone allocator for speed and space efficiency. The current algorithm also appears to lose if the number of active paths exceeds IPFLOW_MAX (256), in which case it wastes lots of time trying to figure out which cache entry to drop.
# 35174	13-Apr-1998	phk	Wrong header length used for certain reassembled IP packets. PR: 6177 Reviewed by: phk, wollman Submitted by: Eric Sprinkle <eric@ennovatenetworks.com>
# 34961	30-Mar-1998	phk	Eradicate the variable "time" from the kernel, using various measures. "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part. Most uses of time.tv_sec now uses the new variable time_second instead. gettime() changed to getmicrotime(0. Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it). A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random. Add a new nfs_curusec() function. Mark a couple of bogosities involving the now disappeard time variable. Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args. Change profiling in ncr.c to use ticks instead of time. Resolution is the same. Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences. Reviewed by: bde
# 34746	21-Mar-1998	peter	Make this compile.. There are some unpleasing hacks in here. A major unifdef session is sorely tempting but would destroy any remaining chance of tracking the original sources.
# 33851	26-Feb-1998	dima	NetBSD PR# 2772 Reviewed by: David Greenman
# 33440	16-Feb-1998	guido	Add new sysctl variable: net.inet.ip.accept_sourceroute It controls if the system is to accept source routed packets. It used to be such that, no matter if the setting of net.inet.ip.sourceroute, source routed packets destined at us would be accepted. Now it is controllable with eth default set to NOT accept those.
# 33268	12-Feb-1998	ache	Replace non-existent ip_forwarding with ipforwarding (compilation error)
# 33249	11-Feb-1998	guido	Only forward source routed packets when ip_forwarding is set to 1. This means that a FreeBSD will only forward source routed packets when both net.inet.ip.forwarding and net.inet.ip.sourceroute are set to 1. You can hit me now ;-) Submitted by: Thomas Ptacek
# 33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
# 33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
# 32358	09-Jan-1998	eivind	Make the BOOTP family new-style options (in opt_bootp.h)
# 31163	13-Nov-1997	julian	Submitted by: Archie cobbs (IPDIVERT author) close small security hole where an atacker could sendpackets with IPDIVERT protocol, and select how it would be diverted thus bypassing the ipfirewall. Discovered by inspection rather than attack. (you'd have to know how the firewall was configured (EXACTLY) to make use of this but..)
# 30966	05-Nov-1997	joerg	Make IPDIVERT a supported option. Alas, in_var.h depends on it, i hope i've found out all files that actually depend on this dependancy. IMHO, it's not very good practice to change the size of internal structs depending on kernel options.
# 30948	05-Nov-1997	julian	Return the entire if info, rather than just the index number. (at least try) Interface index numbers are an abomination that should go away (at least in that form)
# 30816	28-Oct-1997	guido	Fix bugs from my previous commit Submitted by: Bruce Evans
# 30813	28-Oct-1997	bde	Removed unused #includes.
# 30790	27-Oct-1997	guido	When dosourcerouting is set do not sourceoute....
# 29838	24-Sep-1997	wollman	Export ipstat via sysctl. Don't understand why this wasn't done before.
# 29480	15-Sep-1997	ache	Prevent overflow with fragmented packets Reviewed by: wollman
# 27669	25-Jul-1997	brian	Recalculate ip_sum before passing a re-assembled packet to a divert port. Pointed-out by: Ari Suutari <ari@suutari.iki.fi> VS: then name the system in this line, otherwise delete it.
# 26359	02-Jun-1997	julian	Submitted by: Whistle Communications (archie Cobbs) these are quite extensive additions to the ipfw code. they include a change to the API because the old method was broken, but the user view is kept the same. The new code allows a particular match to skip forward to a particular line number, so that blocks of rules can be used without checking all the intervening rules. There are also many more ways of rejecting connections especially TCP related, and many many more ... see the man page for a complete description.
# 25723	11-May-1997	tegge	Bring in some kernel bootp support. This removes the need for netboot to fill in the nfs_diskless structure, at the cost of some kernel bloat. The advantage is that this code works on a wider range of network adapters than netboot. Several new kernel options are documented in LINT. Obtained from: parts of the code comes from NetBSD.
# 24590	03-Apr-1997	darrenr	Resolve conflicts created by import.
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 22927	19-Feb-1997	darrenr	change IP Filter hooks to match new 3.1.8 patches for FreeBSD
# 22531	10-Feb-1997	darrenr	Add IP Filter hooks (from patches).
# 22333	06-Feb-1997	brian	Don't zero ip->ip_sum during sum validation. This should only affect programs that sit on top of divert(4) sockets. The multicast routing code already unconditionally zeros the sum before recalculating. Any code that unconditionaly sums a packet without first zeroing the sum (assuming that it's already zero'd) will break. No such code seems to exist.
# 22212	02-Feb-1997	brian	Reset ip_divert_ignore to zero immediately after use - also, set it in the first place, independent of whether sin->sin_port is set. The result is that diverted packets that are being forwarded will be diverted once and only once on the way in (ip_input()) and again, once and only once on the way out (ip_output()) - twice in total. ICMP packets that don't contain a port will now also be diverted.
# 21932	21-Jan-1997	wollman	Count multicast packets received for groups of which we are not a member separately from generic ``can't forward'' packets. This would have helped me find the previous bug much faster.
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 20407	13-Dec-1996	wollman	Convert the interface address and IP interface address structures to TAILQs. Fix places which referenced these for no good reason that I can see (the references remain, but were fixed to compile again; they are still questionable).
# 20308	11-Dec-1996	dg	Only pay attention to the offset and the IP_MF flag in ip_off. Pointed out by Nathaniel D. Daw (daw@panix.com), but fixed differently by me.
# 19622	11-Nov-1996	fenner	Add the IP_RECVIF socket option, which supplies a packet's incoming interface using a sockaddr_dl. Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR) to work for multicast UDP and raw sockets as well. (They previously only worked for unicast UDP).
# 19183	25-Oct-1996	fenner	Don't allow reassembly to create packets bigger than IP_MAXPACKET, and count attempts to do so. Don't allow users to source packets bigger than IP_MAXPACKET. Make UDP length and ipovly's protocol length unsigned short. Reviewed by: wollman Submitted by: (partly by) kml@nas.nasa.gov (Kevin Lahey)
# 19113	22-Oct-1996	sos	Changed args to the nat functions.
# 18797	07-Oct-1996	wollman	All three files: make COMPAT_IPFW==0 case work again. ip_input.c: - delete some dusty code - _IP_VHL - use fast inline header checksum when possible
# 18160	08-Sep-1996	dg	Dequeue mbuf before freeing it. Fixes mbuf leak and a potential crash when handling IP fragments. Submitted by: Darren Reed <avalon@coombs.anu.edu.au>
# 17758	21-Aug-1996	sos	Add hooks for an IP NAT module, much like the firewall stuff... Move the sockopt definitions for the firewall code from ip_fw.h to in.h where it belongs.
# 17072	10-Jul-1996	julian	Adding changes to ipfw and the kernel to support ip packet diversion.. This stuff should not be too destructive if the IPDIVERT is not compiled in.. be aware that this changes the size of the ip_fw struct so ipfw needs to be recompiled to use it.. more changes coming to clean this up.
# 16333	12-Jun-1996	gpalmer	Convert ipfw to use opt_ipfw.h
# 16206	08-Jun-1996	bde	Changed some memcpy()'s back to bcopy()'s. gcc only inlines memcpy()'s whose count is constant and didn't inline these. I want memcpy() in the kernel go away so that it's obvious that it doesn't need to be optimized. Now it is only used for one struct copy in si.c.
# 15680	08-May-1996	gpalmer	Clean up various compiler warnings. Most (if not all) were benign Reviewed by: bde
# 15211	12-Apr-1996	phk	Fix a bogon I introduced with my last change. Submitted by: Andreas Klemm <andreas@knobel.gun.de>
# 15026	03-Apr-1996	phk	Add feature for tcp "established". Change interface between netinet and ip_fw to be more general, and thus hopefully also support other ip filtering implementations.
# 14817	25-Mar-1996	phk	Check the validity of ia->ia_ifp before we dereference it.
# 14232	24-Feb-1996	phk	Make getsockopt() capable of handling more than one mbuf worth of data. Use this to read rules out of ipfw. Add the lkm code to ipfw.c
# 14230	23-Feb-1996	phk	The new firewall functionality: Filter on the direction (in/out). Filter on fragment/not fragment.
# 14209	23-Feb-1996	phk	Big sweep over the IPFIREWALL and IPACCT code. Close the ip-fragment hole. Waste less memory. Rewrite to contemporary more readable style. Kill separate IPACCT facility, use "accept" rules in IPFIREWALL. Filter incoming >and< outgoing packets. Replace "policy" by sticky "deny all" rule. Rules have numbers used for ordering and deletion. Remove "rerorder" code entirely. Count packet & bytecount matches for rules. Code in -current & -stable is now the same.
# 13929	05-Feb-1996	wollman	Provide a direct entry point for IP input. This actually results in a slight decrease in performance, but will lead to better performance later.
# 13266	05-Jan-1996	wollman	Finally demolished the last, tottering remnants of GATEWAY. If you want to enable IP forwarding, use sysctl(8). Also did the same for IPX, which involved inventing a completely new MIB from whole cloth (which I may not quite have correct); be aware of this if you use IPX forwarding. (The two should never have been controlled by the same option anyway.)
# 12955	21-Dec-1995	wollman	Delete old-style-broadcast-address compatibility cruft in IP input path. If users want to use the old-style broadcast addresses, they will have to currectly configure their systems.
# 12940	20-Dec-1995	wollman	Demolish DIRECTED_BROADCAST. It was always a bad idea, and nobody uses it.
# 12933	19-Dec-1995	wollman	Actually call in_rtqdrain()as was originally intended.
# 12820	14-Dec-1995	phk	Another mega commit to staticize things.
# 12657	06-Dec-1995	bde	Removed unnecessary #includes of vm stuff. Most of them were once prerequisites for <sys/sysctl.h>. subr_prof.c: Also replaced #include of <sys/user.h> by #include of <sys/resourcevar.h>.
# 12296	14-Nov-1995	phk	New style sysctl & staticize alot of stuff.
# 12003	01-Nov-1995	wollman	Instrument the IP input queue with two new read-only MIB entries: net.inet.ip.intr-queue-maxlen (=== ipintrq.ifq_maxlen) and net.inet.ip.intr-queue-drops (=== ipintrq.ifq_drops) There should probably be a standard way of getting the same information going the other way.
# 9575	18-Jul-1995	peter	Change the compile-time option of DIRECTED_BROADCAST into a sysctl variable underneath ip, "directed-broadcast". Reviewed by: David Greenman Obtained from: NetBSD, by Darren Reed.
# 9460	09-Jul-1995	dg	Fixed panic that occurs on certain firewall rejected packets that was caused by dtom() being used on an mbuf cluster. The fix involves passing around the mbuf pointer. Submitted by: Bill Fenner
# 9338	27-Jun-1995	guido	reject option in ip_fw used to panic the system. This fixes it. -Guido Reviewed by: Submitted by: Obtained from:
# 9209	13-Jun-1995	wollman	Kernel side of 3.5 multicast routing code, based on work by Bill Fenner and other work done here. The LKM support is probably broken, but it still compiles and will be fixed later.
# 8876	30-May-1995	rgrimes	Remove trailing whitespace.
# 8426	10-May-1995	wollman	Make networking domains drop-ins, through the magic of GNU ld. (Some day, there may even be LKMs.) Also, change the internal name of `unixdomain' to `localdomain' since AF_LOCAL is now the preferred name of this family. Declare netisr correctly and in the right place.
# 8384	09-May-1995	dg	Replaced some bcopy()'s with memcpy()'s so that gcc while inline/optimize.
# 7091	16-Mar-1995	wollman	Reject source routes unless configured on by administrator.
# 7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 6399	14-Feb-1995	wollman	Attempt to make the host route cache a bit smarter under conditions of high load: 1) If there ever get to be more than net.inet.ip.rtmaxcache entries in the cache, in_rtqtimo() will reduce net.inet.ip.rtexpire by 1/3 and do another round, unles net.inet.ip.rtexpire is less than net.inet.ip.rtminexpire, and never more than once in ten minutes (rtq_timeout). 2) If net.inet.ip.rtexpire is set to zero, don't bother to cache anything.
# 6237	07-Feb-1995	gpalmer	Remove a possible loophole - previously the code wouldn't pass packets destined to the loopback address to the packet filter. Reviewed by: "Ugen J.S.Antsilevich" <ugen@netvision.net.il>
# 5543	12-Jan-1995	ugen	Actual firewall change. 1) Firewall is not subdivided on forwarding / blocking chains anymore.Actually only one chain left-it was the blocking one. 2) LKM support.ip_fwdef.c is function pointers definition and goes into kernel along with all INET stuff.
# 5109	14-Dec-1994	wollman	Make rtq_reallyold user-configurable via sysctl.
# 5105	13-Dec-1994	wollman	Call rtalloc_ign() so that protocol cloning will not occur at the IP layer.
# 5085	12-Dec-1994	ugen	Add match by interface from which packet arrived (via) Handle right fragmented packets. Remove checking option from kernel..
# 4523	16-Nov-1994	jkh	Ugen J.S.Antsilevich's latest, happiest, IP firewall code. Poul: Please take this into BETA. It's non-intrusive, and a rather substantial improvement over what was there before.
# 4277	08-Nov-1994	jkh	Almost 12th hour (the 11th hour was almost an hour ago :-) patches from Ugen.
# 3969	28-Oct-1994	jkh	IP Firewall code from Daniel Boulet and J.S.Antsilevich Submitted by: danny ugen
# 3497	10-Oct-1994	phk	Cosmetics. Silence gcc -Wall.
# 3311	02-Oct-1994	phk	GCC cleanup. Reviewed by: Submitted by: Obtained from:
# 2754	14-Sep-1994	wollman	Shuffle some functions and variables around to make it possible for multicast routing to be implemented as an LKM. (There's still a bit of work to do in this area.)
# 2531	06-Sep-1994	wollman	Initial get-the-easy-case-working upgrade of the multicast code to something more recent than the ancient 1.2 release contained in 4.4. This code has the following advantages as compared to previous versions (culled from the README file for the SunOS release): - True multicast delivery - Configurable rate-limiting of forwarded multicast traffic on each physical interface or tunnel, using a token-bucket limiter. - Simplistic classification of packets for prioritized dropping. - Administrative scoping of multicast address ranges. - Faster detection of hosts leaving groups. - Support for multicast traceroute (code not yet available). - Support for RSVP, the Resource Reservation Protocol. What still needs to be done: - The multicast forwarder needs testing. - The multicast routing daemon needs to be ported. - Network interface drivers need to have the `#ifdef MULTICAST' goop ripped out of them. - The IGMP code should probably be bogon-tested. Some notes about the porting process: In some cases, the Berkeley people decided to incorporate functionality from later releases of the multicast code, but then had to do things differently. As a result, if you look at Deering's patches, and then look at our code, it is not always obvious whether the patch even applies. Let the reader beware. I ran ip_mroute.c through several passes of `unifdef' to get rid of useless grot, and to permanently enable the RSVP support, which we will include as standard. Ported by: Garrett Wollman Submitted by: Steve Deering and Ajit Thyagarajan (among others)
# 2112	18-Aug-1994	wollman	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 1817	02-Aug-1994	dg	Added $Id$
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources