Cross Reference: /freebsd-current/sys/net/route.c

History log of /freebsd-current/sys/net/route.c
Revision	Date	Author	Comments
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 2ff63af9	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .h pattern Remove /^\s\+\s\$FreeBSD\$.$\n/
# 19e43c16	27-Mar-2023	Alexander V. Chernikov <melifaro@FreeBSD.org>	netlink: add netlink KPI to the kernel by default This change does the following: Base Netlink KPIs (ability to register the family, parse and/or write a Netlink message) are always present in the kernel. Specifically, * Implementation of genetlink family/group registration/removal, some base accessors (netlink_generic_kpi.c, 260 LoC) are compiled in unconditionally. * Basic TLV parser functions (netlink_message_parser.c, 507 LoC) are compiled in unconditionally. * Glue functions (netlink<>rtsock), malloc/core sysctl definitions (netlink_glue.c, 259 LoC) are compiled in unconditionally. * The rest of the KPI _functions_ are defined in the netlink_glue.c, but their implementation calls a pointer to either the stub function or the actual function, depending on whether the module is loaded or not. This approach allows to have only 1k LoC out of ~3.7k LoC (current sys/netlink implementation) in the kernel, which will not grow further. It also allows for the generic netlink kernel customers to load successfully without requiring Netlink module and operate correctly once Netlink module is loaded. Reviewed by: imp MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D39269
# 2c2b37ad	13-Jan-2023	Justin Hibbits <jhibbits@FreeBSD.org>	ifnet/API: Move struct ifnet definition to a <net/if_private.h> Hide the ifnet structure definition, no user serviceable parts inside, it's a netstack implementation detail. Include it temporarily in <net/if_var.h> until all drivers are updated to use the accessors exclusively. Reviewed by: glebius Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D38046
# 3636a967	15-Dec-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	route: allow RTM_CHANGE notifications in rt_routemsg(). MFC after: 2 weeks
# 1bcd230f	03-Dec-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	netlink: add interface notification on link status / flags change. * Add link-state change notifications by subscribing to ifnet_link_event. In the Linux netlink model, link state is reported in 2 places: first is the IFLA_OPERSTATE, which stores state per RFC2863. The second is an IFF_LOWER_UP interface flag. As many applications rely on the latter, reserve 1 bit from if_flags, named as IFF_NETLINK_1. This flag is mapped to IFF_LOWER_UP in the netlink headers. This is done to avoid making applications think this flag is actually supported / presented in non-netlink outputs. * Add flag change notifications, by hooking into rt_ifmsg(). In the netlink model, notification should include the bitmask for the change flags. Update rt_ifmsg() to include such bitmask. Differential Revision: https://reviews.freebsd.org/D37597
# 7e5bf684	20-Jan-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	netlink: add netlink support Netlinks is a communication protocol currently used in Linux kernel to modify, read and subscribe for nearly all networking state. Interfaces, addresses, routes, firewall, fibs, vnets, etc are controlled via netlink. It is async, TLV-based protocol, providing 1-1 and 1-many communications. The current implementation supports the subset of NETLINK_ROUTE family. To be more specific, the following is supported: * Dumps: - routes - nexthops / nexthop groups - interfaces - interface addresses - neighbors (arp/ndp) * Notifications: - interface arrival/departure - interface address arrival/departure - route addition/deletion * Modifications: - adding/deleting routes - adding/deleting nexthops/nexthops groups - adding/deleting neghbors - adding/deleting interfaces (basic support only) * Rtsock interaction - route events are bridged both ways The implementation also supports the NETLINK_GENERIC family framework. Implementation notes: Netlink is implemented via loadable/unloadable kernel module, not touching many kernel parts. Each netlink socket uses dedicated taskqueue to support async operations that can sleep, such as interface creation. All message processing is performed within these taskqueues. Compatibility: Most of the Netlink data models specified above maps to FreeBSD concepts nicely. Unmodified ip(8) binary correctly works with interfaces, addresses, routes, nexthops and nexthop groups. Some software such as net/bird require header-only modifications to compile and work with FreeBSD netlink. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D36002 MFC after: 2 months
# 000250be	08-Sep-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: add abitity to set the protocol that installed route/nexthop. Routing daemons such as bird need to know if they install certain route so they can clean it up on startup, as a form of achieving consistent state during the crash recovery. Currently they use combination of routing flags (RTF_PROTO1) to detect these routes when interacting via route(4) rtsock protocol. Netlink protocol has a special "rtm_protocol" field that is filled and checked by the route originator. To prepare for the upcoming netlink introduction, add ability to record origing to both nexthops and nexthop groups via <nhop\|nhgrp>_<get\|set>_origin() KPI. The actual calls will be used in the followup commits. MFC after: 1 month
# 6d4f6e4c	09-Aug-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: make rib_add_redirect() use new nhop-based KPI MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D36169
# 88a782fc	15-Aug-2022	Mateusz Guzik <mjg@FreeBSD.org>	routing: G/C rt_exportinfo declaration Sponsored by: Rubicon Communications, LLC ("Netgate")
# 036f1bc6	14-Aug-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: retire rib_lookup_info() This function was added in pre-epoch era ( 9a1b64d5a0224 ) to provide public rtentry access interface & hide rtentry internals. The implementation is based on the large on-stack copying and refcounting of the referenced objects (ifa/ifp). It has become obsolete after epoch & nexthop introduction. Convert the last remaining user and remove the function itself. Differential Revision: https://reviews.freebsd.org/D36197
# 66230639	03-Aug-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: split nexthop creation and rtentry creation. This change is required for the upcoming introduction of the next nexhop-based operations KPI, as it will create rtentry and nexthops at different stages of route table modification. Differential Revision: https://reviews.freebsd.org/D36072 MFC after: 2 weeks
# 800c6846	28-Jul-2022	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: add nhop(9) kpi. Differential Revision: https://reviews.freebsd.org/D35985 MFC after: 1 month
# 4b631fc8	06-Sep-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: fix source address selection rules for IPv4 over IPv6. Current logic always selects an IFA of the same family from the outgoing interfaces. In IPv4 over IPv6 setup there can be just single non-127.0.0.1 ifa, attached to the loopback interface. Create a separate rt_getifa_family() to handle entire ifa selection for the IPv4 over IPv6. Differential Revision: https://reviews.freebsd.org/D31868 MFC after: 1 week
# d98954e2	29-Aug-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	routing: Bring back the ability to specify transmit interface via its name. Some software references outgoing interfaces by specifying name instead of index. Use rti_ifp from rt_addrinfo if provided instead of always using address interface when constructing nexthop. PR: 255678 Reported by: martin.larsson2 at gmail.com MFC after: 1 week
# a7581946	23-Jun-2021	Rozhuk Ivan <rozhuk.im@gmail.com>	devctl: add ADDR_ADD and ADDR_DEL devctl event for IFNET Add devd event on network iface address add/remove. Can be used to automate actions on any address change. Reviewed by: imp@ (and minor style tweaks) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30840
# 8e8f1cc9	23-Apr-2021	Mark Johnston <markj@FreeBSD.org>	Re-enable network ioctls in capability mode This reverts a portion of 274579831b61 ("capsicum: Limit socket operations in capability mode") as at least rtsol and dhcpcd rely on being able to configure network interfaces while in capability mode. Reported by: bapt, Greg V Sponsored by: The FreeBSD Foundation
# 27457983	07-Apr-2021	Mark Johnston <markj@FreeBSD.org>	capsicum: Limit socket operations in capability mode Capsicum did not prevent certain privileged networking operations, specifically creation of raw sockets and network configuration ioctls. However, these facilities can be used to circumvent some of the restrictions that capability mode is supposed to enforce. Add capability mode checks to disallow network configuration ioctls and creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET internet sockets. Reviewed by: oshogbo Discussed with: emaste Reported by: manu Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29423
# b1d63265	08-Mar-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Flush remaining routes from the routing table during VNET shutdown. Summary: This fixes rtentry leak for the cloned interfaces created inside the VNET. PR: 253998 Reported by: rashey at superbox.pl MFC after: 3 days Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown). Thus, any route table operations are too late to schedule. As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`. It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish. Test Plan: ``` set_skip:set_skip_group_lo -> passed [0.053s] tail -n 200 /var/log/messages \| grep rtentry ``` Reviewers: #network, kp, bz Reviewed By: kp Subscribers: imp, ae Differential Revision: https://reviews.freebsd.org/D29116
# 59641728	22-Feb-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Simplify ifa/ifp refcounting in the routing stack. The routing stack control depends on quite a tree of functions to determine the proper attributes of a route such as a source address (ifa) or transmit ifp of a route. When actually inserting a route, the stack needs to ensure that ifa and ifp points to the entities that are still valid. Validity means slightly more than just pointer validity - stack need guarantee that the provided objects are not scheduled for deletion. Currently, callers either ignore it (most ifp parts, historically) or try to use refcounting (ifa parts). Even in case of ifa refcounting it's not always implemented in fully-safe manner. For example, some codepaths inside rt_getifa_fib() are referencing ifa while not holding any locks, resulting in possibility of referencing scheduled-for-deletion ifa. Instead of trying to fix all of the callers by enforcing proper refcounting, switch to a different model. As the rib_action() already requires epoch, do not require any stability guarantees other than the epoch-provided one. Use newly-added conditional versions of the refcounting functions (ifa_try_ref(), if_try_ref()) and fail if any of these fails. Reviewed by: donner MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D28837
# cb984c62	29-Jan-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix multipath support for rib_lookup_info(). The initial plan was to remove rib_lookup_info() before FreeBSD 13. As several customers are still remaining, fix rib_lookup_info() for the multipath use case.
# 81728a53	08-Jan-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Split rtinit() into multiple functions. rtinit[1]() is a function used to add or remove interface address prefix routes, similar to ifa_maintain_loopback_route(). It was intended to be family-agnostic. There is a problem with this approach in reality. 1) IPv6 code does not use it for the ifa routes. There is a separate layer, nd6_prelist_(), providing interface for maintaining interface routes. Its part, responsible for the actual route table interaction, mimics rtenty() code. 2) rtinit tries to combine multiple actions in the same function: constructing proper route attributes and handling iterations over multiple fibs, for the non-zero net.add_addr_allfibs use case. It notably increases the code complexity. 3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag for p2p connections, host routes and p2p routes are handled in the same way. Additionally, mapping IFA flags to RTF flags makes the interface pretty messy. It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface aliases. 4) rtinit() is the last customer passing non-masked prefixes to rib_action(), complicating rib_action() implementation. 5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive" ifa messages in certain corner cases. To address all these points, the following has been done: * rtinit() has been split into multiple functions: - Route attribute construction were moved to the per-address-family functions, dealing with (2), (3) and (4). - funnction providing net.add_addr_allfibs handling and route rtsock notificaions is the new routing table inteface. - rtsock ifa notificaion has been moved out as well. resulting set of funcion are only responsible for the actual route notifications. Side effects: * /32 alias does not result in interface routes (/32 route and "host" route) * RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses Differential revision: https://reviews.freebsd.org/D28186
# d68cf57b	07-Jan-2021	Alexander V. Chernikov <melifaro@FreeBSD.org>	Refactor rt_addrmsg() and rt_routemsg(). Summary: * Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally. * Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer. * Refactor rtsock_routemsg(): avoid accessing rtentry fields directly. * Simplify in_addprefix() by moving prefix search to a separate function. Reviewers: #network Subscribers: imp, ae, bz Differential Revision: https://reviews.freebsd.org/D28011
# f5baf8bb	25-Dec-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Add modular fib lookup framework. This change introduces framework that allows to dynamically attach or detach longest prefix match (lpm) lookup algorithms to speed up datapath route tables lookups. Framework takes care of handling initial synchronisation, route subscription, nhop/nhop groups reference and indexing, dataplane attachments and fib instance algorithm setup/teardown. Framework features automatic algorithm selection, allowing for picking the best matching algorithm on-the-fly based on the amount of routes in the routing table. Currently framework code is guarded under FIB_ALGO config option. An idea is to enable it by default in the next couple of weeks. The following algorithms are provided by default: IPv4: * bsearch4 (lockless binary search in a special IP array), tailored for small-fib (<16 routes) * radix4_lockless (lockless immutable radix, re-created on every rtable change), tailored for small-fib (<1000 routes) * radix4 (base system radix backend) * dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) IPv6: * radix6_lockless (lockless immutable radix, re-created on every rtable change), tailed for small-fib (<1000 routes) * radix6 (base system radix backend) * dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) Performance changes: Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604): IPv4: 8 routes: radix4: ~20mpps radix4_lockless: ~24.8mpps bsearch4: ~69mpps dpdk_lpm4: ~67 mpps 700k routes: radix4_lockless: 3.3mpps dpdk_lpm4: 46mpps IPv6: 8 routes: radix6_lockless: ~20mpps dpdk_lpm6: ~70mpps 100k routes: radix6_lockless: 13.9mpps dpdk_lpm6: 57mpps Forwarding benchmarks: + 10-15% IPv4 forwarding performance (small-fib, bsearch4) + 25% IPv4 forwarding performance (full-view, dpdk_lpm4) + 20% IPv6 forwarding performance (full-view, dpdk_lpm6) Control: Framwork adds the following runtime sysctls: List algos * net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4 * net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6 Debug level (7=LOG_DEBUG, per-route) net.route.algo.debug_level: 5 Algo selection (currently only for fib 0): net.route.algo.inet.algo: bsearch4 net.route.algo.inet6.algo: radix6_lockless Support for manually changing algos in non-default fib will be added soon. Some sysctl names will be changed in the near future. Differential Revision: https://reviews.freebsd.org/D27401
# d1d941c5	29-Nov-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove RADIX_MPATH config option. ROUTE_MPATH is the new config option controlling new multipath routing implementation. Remove the last pieces of RADIX_MPATH-related code and the config option. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D27244
# 7511a638	22-Nov-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Refactor rib iterator functions. * Make rib_walk() order of arguments consistent with the rest of RIB api * Add rib_walk_ext() allowing to exec callback before/after iteration. * Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del * Rename rt_forach_fib_walk -> rib_foreach_table_walk * Move rib_foreach_table_walk{_del} to route/route_helpers.c * Slightly refactor rib_foreach_table_walk{_del} to make the implementation consistent and prepare for upcoming iterator optimizations. Differential Revision: https://reviews.freebsd.org/D27219
# bad6b236	08-Nov-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move all ifaddr route creation business logic to net/route/route_ifaddr.c Differential Revision: https://reviews.freebsd.org/D26318
# fedeb08b	03-Oct-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Introduce scalable route multipath. This change is based on the nexthop objects landed in D24232. The change introduces the concept of nexthop groups. Each group contains the collection of nexthops with their relative weights and a dataplane-optimized structure to enable efficient nexthop selection. Simular to the nexthops, nexthop groups are immutable. Dataplane part gets compiled during group creation and is basically an array of nexthop pointers, compiled w.r.t their weights. With this change, `rt_nhop` field of `struct rtentry` contains either nexthop or nexthop group. They are distinguished by the presense of NHF_MULTIPATH flag. All dataplane lookup functions returns pointer to the nexthop object, leaving nexhop groups details inside routing subsystem. User-visible changes: The change is intended to be backward-compatible: all non-mpath operations should work as before with ROUTE_MPATH and net.route.multipath=1. All routes now comes with weight, default weight is 1, maximum is 2^24-1. Current maximum multipath group width is statically set to 64. This will become sysctl-tunable in the followup changes. Using functionality: * Recompile kernel with ROUTE_MPATH * set net.route.multipath to 1 route add -6 2001:db8::/32 2001:db8::2 -weight 10 route add -6 2001:db8::/32 2001:db8::3 -weight 20 netstat -6On Nexthop groups data Internet6: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- --------------------------------------- --------- 1 13 10 1 2001:db8::2 vlan2 14 20 2 2001:db8::3 vlan2 Next steps: * Land outbound hashing for locally-originated routes ( D26523 ). * Fix net/bird multipath (net/frr seems to work fine) * Add ROUTE_MPATH to GENERIC * Set net.route.multipath=1 by default Tested by: olivier Reviewed by: glebius Relnotes: yes Differential Revision: https://reviews.freebsd.org/D26449
# 2259a030	21-Sep-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Rework part of routing code to reduce difference to D26449. * Split rt_setmetrics into get_info_weight() and rt_set_expire_info(), as these two can be applied at different entities and at different times. * Start filling route weight in route change notifications * Pass flowid to UDP/raw IP route lookups * Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact that rtentry can contain multiple nexthops. Differential Revision: https://reviews.freebsd.org/D26497
# 05aca418	07-Sep-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Consistently use the same gateway when adding/deleting interface routes. Use the same link-level gateway when adding or deleting interface routes. This helps nexthop checking in the upcoming multipath changes. Differential Revision: https://reviews.freebsd.org/D26317
# 662c1305	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	net: clean up empty lines in .c and .h files
# a624ca3d	28-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move net/route/shared.h definitions to net/route/route_var.h. No functional changes. net/route/shared.h was created in the inital phases of nexthop conversion. It was intended to serve the same purpose as route_var.h - share definitions of functions and structures between the routing subsystem components. At that time route_var.h was included by many files external to the routing subsystem, which largerly defeats its purpose. As currently this is not the case anymore and amount of route_var.h includes is roughly the same as shared.h, retire the latter in favour of the former.
# 592d300e	24-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove RT_LOCK mutex from rte. rtentry lock traditionally served 2 purposed: first was protecting refcounts, the second was assuring consistent field access/changes. Since route nexthop introduction, the need for the former disappeared and the need for the latter reduced. To be more precise, the following rte field are mutable: rt_nhop (nexthop pointer, updated with RIB_WLOCK, passed in rib_cmd_info) rte_flags (only RTF_HOST and RTF_UP, where RTF_UP gets changed at rte removal) rt_weight (relative weight, updated with RIB_WLOCK, passed in rib_cmd_info) rt_expire (time when rte deletion is scheduled, updated with RIB_WLOCK) rt_chain (deletion chain pointer, updated with RIB_WLOCK) All of them are updated under RIB_WLOCK, so the only remaining concern is the reading. rt_nhop and rt_weight (addressed in this review) are read under rib lock and stored in the rib_cmd_info, so the caller has no problem with consitency. rte_flags is currently read unlocked in rtsock reporting (however the scope is only RTF_UP flag, which is pretty static). rt_expire is currently read unlocked in rtsock reporting. rt_chain accesses are safe, as this is only used at route deletion. rt_expire and rte_flags reads will be dealt in a separate reviews soon. Differential Revision: https://reviews.freebsd.org/D26162
# eb1c7adb	22-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Finish r364492 by renaming rt_flags to rte_flags for multipath code.
# 93bfd365	22-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Rename rt_flags to rte_flags && reduce number of rt_nhop accesses. No functional changes. Most of the routing flags are stored in the netxtop instead of rtentry. Rename rt->rt_flags to rt->rte_flags to simplify reading/modifying code checking routing flags. In the new multipath code, rt->rt_nhop may actually point to nexthop group instead of nhop. To ease transition, reduce the amount of rt->rt_nhop->... accesses. Differential Revision: https://reviews.freebsd.org/D26156
# f5247a23	21-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Make net.fibs growable. Allow to dynamically grow the amount of fibs in each vnet. This change alters current behavior. Currently, if one defines ROUTETABLES > 1 in the kernel config, each vnet will be created with the number of fibs defined in the kernel config. After this commit vnets will be created with fibs=1. Dynamic net.fibs is not compatible with net.add_addr_allfibs. The plan is to deprecate the latter and make net.add_addr_allfibs=0 default behaviour. Reviewed by: glebius Relnotes: yes Differential Revision: https://reviews.freebsd.org/D26062
# 2f23f45b	14-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Simplify dom_<rtattach\|rtdetach>. Remove unused arguments from dom_rtattach/dom_rtdetach functions and make them return/accept 'struct rib_head' instead of 'void **'. Declare inet/inet6 implementations in the relevant _var.h headers similar to domifattach / domifdetach. Add rib_subscribe_internal() function to accept subscriptions to the rnh directly. Differential Revision: https://reviews.freebsd.org/D26053
# 6cbadc42	13-Aug-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move rtzone handling code to net/route_ctl.c After moving the route control plane code from net/route.c, all rtzone users ended up being in net/route_ctl.c. Move uma(9) rtzone setup/teardown code to net/route_ctl.c as well to have everything in a single place. While here, remove custom initializers from the zone. It was added originally to avoid setup/teardown of costy per-cpu couters. With these counters removed, the only remaining job was avoiding rte mutex setup/teardown. Mutex setup is relatively cheap. Additionally, this mutex will soon be removed. With that in mind, there is no sense in keeping custom zone callbacks. Differential Revision: https://reviews.freebsd.org/D26051
# f7d79f6c	12-Aug-2020	Mitchell Horne <mhorne@FreeBSD.org>	Correctly set error in rt_mpath_unlink It is possible for rn_delete() to return NULL. If this happens, then set *perror to ESRCH, as is done in the rest of the function. Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D25871
# e1c05fd2	21-Jul-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Transition from rtrequest1_fib() to rib_action(). Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib, in6_rtrequest, rtrequest_fib> and their uses and switch to to rib_action(). This is part of the new routing KPI. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25546
# 72587123	19-Jul-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Temporarly revert r363319 to unbreak the build. Reported by: CI Pointy hat to: melifaro
# 8cee15d9	19-Jul-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Transition from rtrequest1_fib() to rib_action(). Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib, in6_rtrequest, rtrequest_fib> and their uses and switch to to rib_action(). This is part of the new routing KPI. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25546
# edc37a66	12-Jul-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Add destructor for the rib subscription system to simplify users code. Subscriptions are planned to be used by modules such as route lookup engines. In that case that's the module task to properly unsibscribe before detach. However, the in-kernel customer - inet6 wants to track default route changes. To avoid having inet6 store per-fib subscriptions, handle automatic destruction internally. Differential Revision: https://reviews.freebsd.org/D25614
# da187ddb	01-Jun-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	* Add rib_<add\|del\|change>_route() functions to manipulate the routing table. The main driver for the change is the need to improve notification mechanism. Currently callers guess the operation data based on the rtentry structure returned in case of successful operation result. There are two problems with this appoach. First is that it doesn't provide enough information for the upcoming multipath changes, where rtentry refers to a new nexthop group, and there is no way of guessing which paths were added during the change. Second is that some rtentry fields can change during notification and protecting from it by requiring customers to unlock rtentry is not desired. Additionally, as the consumers such as rtsock do know which operation they request in advance, making explicit add/change/del versions of the functions makes sense, especially given the functions don't share a lot of code. With that in mind, introduce rib_cmd_info notification structure and rib_<add\|del\|change>_route() functions, with mandatory rib_cmd_info pointer. It will be used in upcoming generalized notifications. * Move definitions of the new functions and some other functions/structures used for the routing table manipulation to a separate header file, net/route/route_ctl.h. net/route.h is a frequently used file included in ~140 places in kernel, and 90% of the users don't need these definitions. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D25067
# e7403d02	01-Jun-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Revert r361704, it accidentally committed merged D25067 and D25070.
# 79674562	01-Jun-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	* Add rib_<add\|del\|change>_route() functions to manipulate the routing table. The main driver for the change is the need to improve notification mechanism. Currently callers guess the operation data based on the rtentry structure returned in case of successful operation result. There are two problems with this appoach. First is that it doesn't provide enough information for the upcoming multipath changes, where rtentry refers to a new nexthop group, and there is no way of guessing which paths were added during the change. Second is that some rtentry fields can change during notification and protecting from it by requiring customers to unlock rtentry is not desired. Additionally, as the consumers such as rtsock do know which operation they request in advance, making explicit add/change/del versions of the functions makes sense, especially given the functions don't share a lot of code. With that in mind, introduce rib_cmd_info notification structure and rib_<add\|del\|change>_route() functions, with mandatory rib_cmd_info pointer. It will be used in upcoming generalized notifications. * Move definitions of the new functions and some other functions/structures used for the routing table manipulation to a separate header file, net/route/route_ctl.h. net/route.h is a frequently used file included in ~140 places in kernel, and 90% of the users don't need these definitions. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D25067
# cb86ca48	28-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Unlock rtentry before calling for epoch(9) destruction as the destruction may happen immediately, leading to panic. Reported by: bdragon
# 4d2c2509	23-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move <add\|del\|change>_route() functions to route_ctl.c in preparation of multipath control plane changed described in D24141. Currently route.c contains core routing init/teardown functions, route table manipulation functions and various helper functions, resulting in >2KLOC file in total. This change moves most of the route table manipulation parts to a dedicated file, simplifying planned multipath changes and making route.c more manageable. Differential Revision: https://reviews.freebsd.org/D24870
# a82f62ec	22-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove refcounting from rtentry. After making rtentry reclamation backed by epoch(9) in r361409, there is no reason in keeping reference counting code. Differential Revision: https://reviews.freebsd.org/D24867
# 2bbab0af	23-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Use epoch(9) for rtentries to simplify control plane operations. Currently the only reason of refcounting rtentries is the need to report the rtable operation details immediately after the execution. Delaying rtentry reclamation allows to stop refcounting and simplify the code. Additionally, this change allows to reimplement rib_lookup_info(), which is used by some of the customers to get the matching prefix along with nexthops, in more efficient way. The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to nhop_priv to be able to reliably set curvnet even during vnet teardown. Rest of the reference counting code will be removed in the D24867 . Differential Revision: https://reviews.freebsd.org/D24866
# 4a6ee281	11-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove unused rnh_close callback from rtable & cleanup depends. rnh_close callbackes was used by the in[6]_clsroute() handlers, doing cleanup in the route cloning code. Route cloning was eliminated somewhere around r186119. Last callback user was eliminated in r186215, 11 years ago. Differential Revision: https://reviews.freebsd.org/D24793
# d2233725	10-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove rtalloc1(_fib) KPI. Last user of rtalloc1() KPI has been eliminated in rS360631. As kernel is now fully switched to use new routing KPI defined in rS359823, remove old lookup functions. Differential Revision: https://reviews.freebsd.org/D24776
# 656442a7	08-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Embed dst sockaddr into rtentry and remove rte packet counter Currently each rtentry has dst&gateway allocated separately from another zone, bloating cache accesses. Current 'struct rtentry' has 12 "mandatory" radix pointers in the beginning, leaving 4 usable pointers/32 bytes in the first 2 cache lines (amd64). Fields needed for the datapath are destination sockaddr and rt_nhop. So far it doesn't look like there is other routable addressing protocol other than IPv4/IPv6/MPLS, which uses keys longer than 20 bytes. With that in mind, embed dst into struct rtentry, making the first 24 bytes of rtentry within 128 bytes. That is enough to make IPv6 address within first 128 bytes. It is still pretty easy to add code for supporting separately-allocated dst, however it doesn't make a lot of sense in having such code without a use case. As rS359823 moved the gateway to the nexthop structure, the dst embedding change removes the need for any additional allocations done by rt_setgate(). Lastly, as a part of cleanup, remove counter(9) allocation code, as this field is not used in packet processing anymore. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24669
# 682b902d	07-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Add rib_lookup() sockaddr lookup wrapper and make ifa_ifwithroute use it. Create rib_lookup() wrapper around per-af dataplane lookup functions. This will help in the cases of having control plane af-agnostic code. Switch ifa_ifwithroute() to use this function instead of rtalloc1(). Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24731
# 9e022295	04-May-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove now-unused rt_ifp,rt_ifa,rt_gateway,rt_mtu rte fields. After converting routing subsystem customers to use nexthop objects defined in r359823, some fields in struct rtentry became unused. This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry along with the code initializing and updating these fields. Cleanup of the remaining fields will be addressed by D24669. This commit also changes the implementation of the RTM_CHANGE handling. Old implementation tried to perform the whole operation under radix WLOCK, resulting in slow performance and hacks like using RTF_RNH_LOCKED flag. New implementation looks up the route nexthop under radix RLOCK, creates new nexthop and tries to update rte nhop pointer. Only last part is done under WLOCK. In the hypothetical scenarious where multiple rtsock clients repeatedly issue RTM_CHANGE requests for the same route, route may get updated between read and update operation. This is addressed by retrying the operation multiple (3) times before returning failure back to the caller. Differential Revision: https://reviews.freebsd.org/D24666
# 8c61eb21	29-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Convert more rtentry field accesses into nhop fields accesses. Continue routing subsystem conversion to nhop objects defined in r359823. Use fields from nhop structure instead of "struct rtentry" fields. This is one of the last changes prior to removing rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry. Differential Revision: https://reviews.freebsd.org/D24609
# 74787ef4	29-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Add nhop to the ifa_rtrequest() callback. With the upcoming multipath changes described in D24141, rt->rt_nhop can potentially point to a nexthop group instead of an individual nhop. To simplify caller handling of such cases, change ifa_rtrequest() callback to pass changed nhop directly. Differential Revision: https://reviews.freebsd.org/D24604
# e7d8af4f	28-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move route_temporal.c and route_var.h to net/route. Nexthop objects implementation, defined in r359823, introduced sys/net/route directory intended to hold all routing-related code. Move recently-introduced route_temporal.c and private route_var.h header there. Differential Revision: https://reviews.freebsd.org/D24597
# 1b0051ba	28-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Eliminate now-unused parts of old routing KPI. r360292 switched most of the remaining routing customers to a new KPI, leaving a bunch of wrappers for old routing lookup functions unused. Remove them from the tree as a part of routing cleanup. Differential Revision: https://reviews.freebsd.org/D24569
# 983066f0	25-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Convert route caching to nexthop caching. This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340
# aaad3c4f	23-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Convert rtentry field accesses into nhop field accesses. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver more features much faster. This commit is mostly mechanical change to eliminate direct struct rtentry field accesses. The only notable difference is AF_LINK gateway encoding. AF_LINK gw is used in routing stack for operations with interface routes and host loopback routes. In the former case it indicates _some_ non-NULL gateway, as the interface is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting. In the latter case the interface index inside gateway was used by the IPv6 datapath to verify address scope for link-local interfaces. Kernel uses struct sockaddr_dl for this type of gateway. This structure allows for specifying rich interface data, such as mac address and interface name. However, this results in relatively large structure size - 52 bytes. Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside in the first 8 bytes of the structure. In the new KPI, struct nhop_object tries to be cache-efficient, hence embodies gateway address inside the structure. In the AF_LINK case it stores stortened version of the structure - struct sockaddr_dl_short, which occupies 16 bytes. After D24340 changes, the data inside AF_LINK gateway will not be used in the kernel at all, leaving rtsock as the only potential concern. The difference in rtsock reporting: (old) got message of size 240 on Thu Apr 16 03:12:13 2020 RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 link#5 255.255.255.0 (new) got message of size 200 on Sun Apr 19 09:46:32 2020 RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 link#5 255.255.255.0 Note 40 bytes different (52-16 + alignment). However, gateway is still a valid AF_LINK gateway with proper data filled in. It is worth noting that these particular messages (interface routes) are mostly ignored by routing daemons: * bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages. * quagga/frr ignores routes without gateway More detailed overview on how rtsock messages are used by the routing daemons to reconstruct the kernel view, can be found in D22974. Differential Revision: https://reviews.freebsd.org/D24519
# 539642a2	16-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Add nhop parameter to rti_filter callback. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver more features much faster. This change is one of the ongoing changes to eliminate direct struct rtentry field accesses. Additionally, with the followup multipath changes, single rtentry can point to multiple nexthops. With that in mind, convert rti_filter callback used when traversing the routing table to accept pair (rt, nhop) instead of nexthop. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24440
# dd4776f0	14-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Reorganise nd6 notification code to avoid direct rtentry field access. One of the goals of the new routing KPI defined in r359823 is to entirely hide `struct rtentry` from the consumers. Doing so will allow to improve routing subsystem internals and deliver features more easily. This change is one of the ongoing changes to eliminate direct struct rtentry field accesses. It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification code to avoid accessing most of the rtentry fields. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24404
# a6663252	12-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Introduce nexthop objects and new routing KPI. This is the foundational change for the routing subsytem rearchitecture. More details and goals are available in https://reviews.freebsd.org/D24141 . This patch introduces concept of nexthop objects and new nexthop-based routing KPI. Nexthops are objects, containing all necessary information for performing the packet output decision. Output interface, mtu, flags, gw address goes there. For most of the cases, these objects will serve the same role as the struct rtentry is currently serving. Typically there will be low tens of such objects for the router even with multiple BGP full-views, as these objects will be shared between routing entries. This allows to store more information in the nexthop. New KPI: struct nhop_object fib4_lookup(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, uint32_t flowid); struct nhop_object fib6_lookup(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, uint32_t flowid); These 2 function are intended to replace all all flavours of <in_\|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous fib[46]-generation functions. Upon successful lookup, they return nexthop object which is guaranteed to exist within current NET_EPOCH. If longer lifetime is desired, one can specify NHR_REF as a flag and get a referenced version of the nexthop. Reference semantic closely resembles rtentry one, allowing sed-style conversion. Additionally, another 2 functions are introduced to support uRPF functionality inside variety of our firewalls. Their primary goal is to hide the multipath implementation details inside the routing subsystem, greatly simplifying firewalls implementation: int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); All functions have a separate scopeid argument, paving way to eliminating IPv6 scope embedding and allowing to support IPv4 link-locals in the future. Structure changes: * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size. * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz. Old KPI: During the transition state old and new KPI will coexists. As there are another 4-5 decent-sized conversion patches, it will probably take a couple of weeks. To support both KPIs, fields not required by the new KPI (most of rtentry) has to be kept, resulting in the temporary size increase. Once conversion is finished, rtentry will notably shrink. More details: * architectural overview: https://reviews.freebsd.org/D24141 * list of the next changes: https://reviews.freebsd.org/D24232 Reviewed by: ae,glebius(initial version) Differential Revision: https://reviews.freebsd.org/D24232
# aef2d5fb	10-Apr-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Split rtrequest1_fib() into smaller manageable chunks. No functional changes. * Move route addition / route deletion code from rtrequest1_fib() to add_route() and del_route() respectively. * Rename rtrequest1_fib_change() to change_route() for consistency. * Shrink the scope of ugly info #defines. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D24349
# ea277332	03-Mar-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix dynamic redrects by adding forgotten RTF_HOST flag. Improve tests to verify the generated route flags. Reported by: jtl MFC after: 2 weeks
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# 34a5582c	22-Jan-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Bring back redirect route expiration. Redirect (and temporal) route expiration was broken a while ago. This change brings route expiration back, with unified IPv4/IPv6 handling code. It introduces net.inet.icmp.redirtimeout sysctl, allowing to set an expiration time for redirected routes. It defaults to 10 minutes, analogues with net.inet6.icmp6.redirtimeout. Implementation uses separate file, route_temporal.c, as route.c is already bloated with tons of different functions. Internally, expiration is implemented as an per-rnh callout scheduled when route with non-zero rt_expire time is added or rt_expire is changed. It does not add any overhead when no temporal routes are present. Callout traverses entire routing tree under wlock, scheduling expired routes for deletion and calculating the next time it needs to be run. The rationale for such implemention is the following: typically workloads requiring large amount of routes have redirects turned off already, while the systems with small amount of routes will not inhibit large overhead during tree traversal. This changes also fixes netstat -rn display of route expiration time, which has been broken since the conversion from kread() to sysctl. Reviewed by: bz MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D23075
# 97168be8	14-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Mechanically substitute assertion of in_epoch(net_epoch_preempt) to NET_EPOCH_ASSERT(). NFC
# ead85fe4	09-Jan-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Add fibnum, family and vnet pointer to each rib head. Having metadata such as fibnum or vnet in the struct rib_head is handy as it eases building functionality in the routing space. This change is required to properly bring back route redirect support. Reviewed by: bz MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D23047
# e02d3fe7	07-Jan-2020	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix rtsock route message generation for interface addresses. Reviewed by: olivier MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D22974
# 6b5d8e30	26-Dec-2019	Mark Johnston <markj@FreeBSD.org>	Plug some ifaddr refcount leaks. - Only take an ifaddr ref in in rt_exportinfo() if the caller explicitly requests it. Take care to release it in this case. - Don't unconditionally take a ref in rtrequest1_fib(). rt_getifa_fib() will acquire a reference, in which case we would previously acquire two references. - Stop taking a reference in rtinit1() before calling rtrequest1_fib(). rtrequest1_fib() will acquire a reference for the RTM_ADD case. PR: 242746 Reviewed by: melifaro (previous version) Tested by: ghuckriede@blackberry.com MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22912
# 185c3d2b	16-Dec-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Convert routing statistics to VNET_PCPUSTAT. Submitted by: ocochard Reviewed by: melifaro, glebius Differential Revision: https://reviews.freebsd.org/D22834
# fda45409	18-Oct-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Make rt_getifa_fib() static.
# 270b83b9	14-Oct-2019	Hans Petter Selasky <hselasky@FreeBSD.org>	The two functions ifnet_byindex() and ifnet_byindex_locked() are exactly the same after the network stack was epochified. Merge the two into one function and cleanup all uses of ifnet_byindex_locked(). While at it: - Add branch prediction macros. - Make sure the ifnet pointer is only deferred once, also when code optimisation is disabled. Sponsored by: Mellanox Technologies
# 69104ebe	13-Oct-2019	Michael Tuexen <tuexen@FreeBSD.org>	Add missing include which breaks builds without VIMAGE. The bug was introduced by me in r353480. Reported by: Michael Butler MFC after: 3 days
# d6e23cf0	13-Oct-2019	Michael Tuexen <tuexen@FreeBSD.org>	Use an event handler to notify the SCTP about IP address changes instead of calling an SCTP specific function from the IP code. This is a requirement of supporting SCTP as a kernel loadable module. This patch was developed by markj@, I tweaked a bit the SCTP related code. Submitted by: markj@ MFC after: 3 days
# b8a6e03f	07-Oct-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111
# d8dc4e35	03-Aug-2019	George V. Neville-Neil <gnn@FreeBSD.org>	Properly validte arguments for route deletion Reported by: Liang Zhuo brightiup.zhuo@gmail.com MFC after: 1 week
# 563ab4e4	22-May-2019	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix gateway setup for the interface routes. Currently rinit1() and its IPv6 counterpart nd6_prefix_onlink_rtrequest() uses dummy null_sdl gateway address during route insertion and change it afterwards. This behaviour brings complications to the routing stack and the users of its upcoming notification system. This change fixes both rinit1() and nd6_prefix_onlink_rtrequest() by filling in proper gateway in the beginning. It does not change any of the userland notifications as in both cases, they happen after the insertion and fixup process (rt_newaddrmsg_fib() and nd6_rtmsg()). MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20328
# 2ad7ed6e	19-May-2019	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix rt_ifa selection during loopback route insertion process. Currently such routes are added with a link-level IFA, which is plain wrong. Only after the insertion they get fixed by the special link_rtrequest() ifa handler. This behaviour complicates routing code and makes ifa selection more complex. Streamline this process by explicitly moving link_rtrequest() logic to the pre-insertion rt_getifa_fib() ifa selector. Avoid calling all this logic in the loopback route case by explicitly specifying proper rt_ifa inside the ifa_maintain_loopback_route().§ MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20076
# a68cc388	08-Jan-2019	Gleb Smirnoff <glebius@FreeBSD.org>	Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin
# 5f901c92	24-Jul-2018	Andrew Turner <andrew@FreeBSD.org>	Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147
# 6573d758	03-Jul-2018	Matt Macy <mmacy@FreeBSD.org>	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066
# 20efcfc6	16-Jun-2018	Andrey V. Elsukov <ae@FreeBSD.org>	Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9). Using of rwlock with multiqueue NICs for IP forwarding on high pps produces high lock contention and inefficient. Rmlock fits better for such workloads. Reviewed by: melifaro, olivier Obtained from: Yandex LLC Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D15789
# b8af2820	07-Jun-2018	Mateusz Guzik <mjg@FreeBSD.org>	uma: fix up r334824 Turns out there is code which ends up passing M_ZERO to counters. Since counters zero unconditionally on their own, just ignore drop the flag in that place.
# 58378a89	07-Jun-2018	Matt Macy <mmacy@FreeBSD.org>	rtentry_zinit: don't blindly pass through M_ZERO to counter alloc
# 134804c8	29-May-2018	Matt Macy <mmacy@FreeBSD.org>	rt_getifa_fib: don't use ifa but info->rti_ifa Reported by: kp
# 1ebec5fa	28-May-2018	Matt Macy <mmacy@FreeBSD.org>	route: fix missed ref adds - ensure that we bump the ifa ref whenever we add a reference - defer freeing epoch protected references until after the if_purgaddrs loop
# 9379029a	25-May-2018	Matt Macy <mmacy@FreeBSD.org>	rtrequest1_fib: we need to always bump the ifaddr refcount when we take a reference from an rtentry. r334118 introduced a case when this was not done. While we're here make the intent more obvious by moving the refcount bump down to when we know we'll actually need it. Reported by: markj
# 4f6c66cc	23-May-2018	Matt Macy <mmacy@FreeBSD.org>	UDP: further performance improvements on tx Cumulative throughput while running 64 netperf -H $DUT -t UDP_STREAM -- -m 1 on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps Single stream throughput increases from 910kpps to 1.18Mpps Baseline: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg - Protect read access to global ifnet list with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg - Protect short lived ifaddr references with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg - Convert if_afdata read lock path to epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg A fix for the inpcbhash contention is pending sufficient time on a canary at LLNW. Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15409
# 891cf3ed	18-May-2018	Ed Maste <emaste@FreeBSD.org>	Use NULL for SYSINIT's last arg, which is a pointer type Sponsored by: The FreeBSD Foundation
# bc3d87fd	22-Jan-2018	Ryan Stone <rstone@FreeBSD.org>	Increment the route table gen count after a modify Increment the route table generation count after modifying a route. This signals back to TCP connections that they need to update their L2 caches as the gateway for their route may have changed. This is a heavier hammer than is needed, strictly speaking, but route changes will be unlikely enough that the performance effects of invalidating all connection route caches should be negligible. MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13990 Reviewed by: karels
# 19f41c2a	14-Dec-2017	Ryan Stone <rstone@FreeBSD.org>	Plug an ifaddr leak when changing a route's src If a route is modified in a way that changes the route's source address (i.e. the address used to access the gateway), then a reference on the ifaddr representing the old source address will be leaked if the address type does not have an ifa_rtrequest method defined. Plug the leak by releasing the reference in all cases. Differential Revision: https://reviews.freebsd.org/D13417 Reviewed by: ae MFC after: 3 weeks Sponsored by: Dell
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# ae69ad88	27-Jul-2017	Bjoern A. Zeeb <bz@FreeBSD.org>	After inpcb route caching was put back in place there is no need for flowtable anymore (as flowtable was never considered to be useful in the forwarding path). Reviewed by: np Differential Revision: https://reviews.freebsd.org/D11448
# b83aa367	13-Jun-2017	Andrey V. Elsukov <ae@FreeBSD.org>	Resurrect RTF_RNH_LOCKED flag and restore ability to call rtalloc1_fib() with acquired RIB lock. This fixes a possible panic due to trying to acquire RIB rlock when it is already exclusive locked. PR: 215963, 215122 MFC after: 1 week Sponsored by: Yandex LLC
# fbbd9655	28-Feb-2017	Warner Losh <imp@FreeBSD.org>	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96
# 8f1c8ade	08-Dec-2016	Luiz Otavio O Souza <loos@FreeBSD.org>	Fix the typos and style(9) in comment. MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC (Netgate)
# abe95d87	06-Oct-2016	Andrey V. Elsukov <ae@FreeBSD.org>	Replace rw_init/rw_destroy with corresponding macros. Obtained from: Yandex LLC
# 89856f7e	21-Jun-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Get closer to a VIMAGE network stack teardown from top to bottom rather than removing the network interfaces first. This change is rather larger and convoluted as the ordering requirements cannot be separated. Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and related modules to their own SI_SUB_PROTO_FIREWALL. Move initialization of "physical" interfaces to SI_SUB_DRIVERS, move virtual (cloned) interfaces to SI_SUB_PSEUDO. Move Multicast to SI_SUB_PROTO_MC. Re-work parts of multicast initialisation and teardown, not taking the huge amount of memory into account if used as a module yet. For interface teardown we try to do as many of them as we can on SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling over a higher layer protocol such as IP. In that case the interface has to go along (or before) the higher layer protocol is shutdown. Kernel hhooks need to go last on teardown as they may be used at various higher layers and we cannot remove them before we cleaned up the higher layers. For interface teardown there are multiple paths: (a) a cloned interface is destroyed (inside a VIMAGE or in the base system), (b) any interface is moved from a virtual network stack to a different network stack ("vmove"), or (c) a virtual network stack is being shut down. All code paths go through if_detach_internal() where we, depending on the vmove flag or the vnet state, make a decision on how much to shut down; in case we are destroying a VNET the individual protocol layers will cleanup their own parts thus we cannot do so again for each interface as we end up with, e.g., double-frees, destroying locks twice or acquiring already destroyed locks. When calling into protocol cleanups we equally have to tell them whether they need to detach upper layer protocols ("ulp") or not (e.g., in6_ifdetach()). Provide or enahnce helper functions to do proper cleanup at a protocol rather than at an interface level. Approved by: re (hrs) Obtained from: projects/vnet Reviewed by: gnn, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D6747
# 80ae8d60	05-Jun-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Provide a public interface to rt_flushifroutes which takes the address family as an argument as well. This will be used to cleanup individual protocols during VNET teardown. Obtained from: projects/vnet Sponsored by: The FreeBSD Foundation
# 6d768226	02-Jun-2016	George V. Neville-Neil <gnn@FreeBSD.org>	This change re-adds L2 caching for TCP and UDP, as originally added in D4306 but removed due to other changes in the system. Restore the llentry pointer to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as appropriate. Submitted by: Mike Karels Differential Revision: https://reviews.freebsd.org/D6262
# 4f321dbd	24-Mar-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix compile errors after r297225: - properly V_irtualise variable access unbreaking VIMAGE kernels. - remove the volatile from the function return type to make architecture using gcc happy [-Wreturn-type] "type qualifiers ignored on function return type" I am not entirely happy with this solution putting the u_int there but it will do for now.
# 84cc0778	24-Mar-2016	George V. Neville-Neil <gnn@FreeBSD.org>	FreeBSD previously provided route caching for TCP (and UDP). Re-add route caching for TCP, with some improvements. In particular, invalidate the route cache if a new route is added, which might be a better match. The cache is automatically invalidated if the old route is deleted. Submitted by: Mike Karels Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4306
# a5243af2	03-Feb-2016	Bjoern A. Zeeb <bz@FreeBSD.org>	Code duplication but rib_head is special. Not found an easy way to go back and harmize the use cases among RIB, IPFW, PF yet but it's also not the scope of this work. Prevents instant panics on teardown and frees the FIB bits again. Sponsored by: The FreeBSD Foundation
# 94017572	25-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix flowtable part missed in r294706.
# 61eee0e2	24-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	MFP r287070,r287073: split radix implementation and route table structure. There are number of radix consumers in kernel land (pf,ipfw,nfs,route) with different requirements. In fact, first 3 don't have _any_ requirements and first 2 does not use radix locking. On the other hand, routing structure do have these requirements (rnh_gen, multipath, custom to-be-added control plane functions, different locking). Additionally, radix should not known anything about its consumers internals. So, radix code now uses tiny 'struct radix_head' structure along with internal 'struct radix_mask_head' instead of 'struct radix_node_head'. Existing consumers still uses the same 'struct radix_node_head' with slight modifications: they need to pass pointer to (embedded) 'struct radix_head' to all radix callbacks. Routing code now uses new 'struct rib_head' with different locking macro: RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing information base). New net/route_var.h header was added to hold routing subsystem internal data. 'struct rib_head' was placed there. 'struct rtentry' will also be moved there soon.
# fcbfdb37	14-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix panic in IP redirect. Panic was introduced in r293466. Found by: Yamagi Burmeister <lists at yamagi.org>>
# 10e0e235	14-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove now-unused wrappers for various routing functions.
# 0eb64f4e	13-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove RTF_RNH_LOCKED support from rtalloc1_fib(). Last caller using it was eliminated in r293471. Sponsored by: Yandex LLC
# f2b2e77a	08-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	(Temporarily) remove route_redirect_event eventhandler. Such handler should pass different set of variables, instead of directly providing 2 locked route entries. Given that it hasn't been really used since at least 2012, remove current code. Will re-add it after finishing most major routing-related changes. Discussed with: np
# 16703ea8	08-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	Please Coverity by removing unneccessary check (rt_key() is always set). Coverity CID: 1347797
# 048738b5	08-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	Do more fine-grained locking in rtrequest1_fib(). Last consumer using RTF_RNH_LOCKED flag was eliminated in r291643. Restrict passing RTF_RNH_LOCKED to rtrequest1_fib() and do better locking for RTM_ADD / RTM_DELETE cases.
# 9a1b64d5	04-Jan-2016	Alexander V. Chernikov <melifaro@FreeBSD.org>	Add rib_lookup_info() to provide API for retrieving individual route entries data in unified format. There are control plane functions that require information other than just next-hop data (e.g. individual rtentry fields like flags or prefix/mask). Given that the goal is to avoid rte reference/refcounting, re-use rt_addrinfo structure to store most rte fields. If caller wants to retrieve key/mask or gateway (which are sockaddrs and are allocated separately), it needs to provide sufficient-sized sockaddrs structures w/ ther pointers saved in passed rt_addrinfo. Convert: * lltable new records checks (in_lltable_rtcheck(), nd6_is_new_addr_neighbor(). * rtsock pre-add/change route check. * IPv6 NS ND-proxy check (RADIX_MPATH code was eliminated because 1) we don't support RTF_ANNOUNCE ND-proxy for networks and there should not be multiple host routes for such hosts 2) if we have multiple routes we should inspect them (which is not done). 3) the entire idea of abusing KRT as storage for ND proxy seems odd. Userland programs should be used for that purpose).
# 6af272d8	13-Dec-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix PINNED routes handling. Before r291643, adding new interface prefix had the following logic: try_add: EEXIST && (PINNED) { try_del(w/o PINNED flag) if (OK) try_add(PINNED) } In r291643, deletion was performed w/ PINNED flag held which leaded to new interface prefixes (like ::1) overriding older ones. Fix this by requesting deletion w/o RTF_PINNED. PR: kern/205285 Submitted by: Fabian Keil <fk at fabiankeil.de>
# 4b3dc898	02-Dec-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move RTF_PINNED handling to generic route code. This eliminates last RTF_RNH_LOCKED rtrequest1_fib() user.
# af5c99e5	30-Nov-2015	Enji Cooper <ngie@FreeBSD.org>	Fix LINT-NOIP kernels after r291467 rn is only used if INET or INET6 are defined Sponsored by: EMC / Isilon Storage Division
# 674e0823	29-Nov-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move flowtable rte checks to separate function.
# e8b0643e	29-Nov-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Add new rt_foreach_fib_walk_del() function for deleting route entries by filter function instead of picking into routing table details in each consumer. Remove now-unused rt_expunge() (eliminating last external RTF_RNH_LOCKED user). This simplifies future nexthops/mulitipath changes and rtrequest1_fib() locking refactoring. Actual changes: Add "rt_chain" field to permit rte grouping while doing batched delete from routing table (thus growing rte 200->208 on amd64). Add "rti_filter" / "rti_filterdata" / "rti_spare" fields to rt_addrinfo to pass filter function to various routing subsystems in standard way. Convert all rt_expunge() customers to new rt_addinfo-based api and eliminate rt_expunge().
# e4790abf	14-Nov-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Pass provided af instead of AF_UNSPEC to setwa_f callback.
# 2780ba06	29-Oct-2015	Bryan Drewery <bdrewery@FreeBSD.org>	Avoid passing an uninitialized 'i'. Currently nothing was depending on it anyhow. Coverity CID: 1331562
# f221bcaa	17-Oct-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove several compat functions from pre-fib era.
# 17a03656	14-Sep-2015	Eric van Gyzen <vangyzen@FreeBSD.org>	Fix the handling of IPv6 On-Link Redirects. On receipt of a redirect message, install an interface route for the redirected destination. On removal of the corresponding Neighbor Cache entry, remove the interface route. This requires changes in rtredirect_fib() to cope with an AF_LINK address for the gateway and with the absence of RTF_GATEWAY. This fixes the "Redirected On-Link" test cases in the Tahi IPv6 Ready Logo Phase 2 test suite. Unrelated to the above, fix a recursion on the radix node head lock triggered by the Tahi Redirected to Alternate Router test cases. When I first wrote this patch in October 2012, all Section 2 (Neighbor Discovery) test cases passed on 10-CURRENT, 9-STABLE, and 8-STABLE. cem@ recently rebased the 10.x patch onto head and reported that it passes Tahi. (Thanks!) These other test cases also passed in 2012: * the RTF_MODIFIED case, with IPv4 and IPv6 (using a RTF_HOST\|RTF_GATEWAY route for the destination) * the redirected-to-self case, with IPv4 and IPv6 * a valid IPv4 redirect All testing in 2012 was done with WITNESS and INVARIANTS. Tested by: EMC / Isilon Storage Division via Conrad Meyer (cem) in 2015, Mark Kelley <mark_kelley@dell.com> in 2012, TC Telkamp <terence_telkamp@dell.com> in 2012 PR: 152791 Reviewed by: melifaro (current rev), bz (earlier rev) Approved by: kib (mentor) MFC after: 1 month Relnotes: yes Sponsored by: Dell Inc. Differential Revision: https://reviews.freebsd.org/D3602
# 441f9243	04-Sep-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Constantify lookup key in ifa_ifwith* functions. Some places in our network stack already have const arguments (like if_output() routines and LLE functions). Code using ifa_ifwith (and similar functins) along with LLE/_output functions is currently bound to use tricks like __DECONST(). Provide a cleaner way by making sockaddr lookup key really constant. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D3464
# 2caee4be	10-Aug-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	Rename rt_foreach_fib() to rt_foreach_fib_walk(). Suggested by: julian
# 4bdf0b6a	08-Aug-2015	Alexander V. Chernikov <melifaro@FreeBSD.org>	MFP r274295: * Move interface route cleanup to route.c:rt_flushifroutes() * Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users to use new rt_foreach_fib() instead of hand-rolling cycles.
# 8b15f615	29-Jul-2015	Luiz Otavio O Souza <loos@FreeBSD.org>	Follow r256586 and rename the kernel version of the Free() macro to R_Free(). This matches the other macros and reduces the chances to clash with other headers. This also fixes the build of radix.c outside of the kernel environment. Reviewed by: glebius
# 546afaf8	15-Apr-2015	Marcelo Araujo <araujo@FreeBSD.org>	Remove duplicate header entry.
# 5d14e4cd	29-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Provide rte_<get\|set> methods to access rtentry for external consumers.
# 1be1588a	29-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	* Make ifa_add_loopback_route() prepare gw before insertion. * Temporarily move ifa_switch_loopback_route() implementation to route.c
# acbc394d	23-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Finish r274335#2: put RT_LOCK_DESTROY() back.
# 7f948f12	16-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Finish r274175: do control plane MTU tracking. Update route MTU in case of ifnet MTU change. Add new RTF_FIXEDMTU to track explicitly specified MTU. Old behavior: ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU. User has to manually update all routes. ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU. However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu gets updated. New behavior: ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated with new MTU unless RTF_FIXEDMTU flag set on them. ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu. route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag. route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag. PR: 194238 MFC after: 1 month CR: D1125
# 98af5b3a	16-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Finish r274335: * put RT_LOCK_DESTROY() back * remove unused RT_UNLOCK_COND macro
# ac2cf5d3	16-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Revert r274585: rte lock is properly destroyed in uma dtor callback. Pointed by: glebius
# 3cb04899	16-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Make witness happy: destroy rte lock before free. MFC after: 2 weeks
# f7bab8d0	09-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Switch route radix to dual-lock model: use rmlock for data patch access, and config rwlock for conrol plane processing. Route table changes require bock locks held.
# 69d149ad	09-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Since we no longer return individual radix entries, it is not possible to do per-rte accounting. Remove rt_kpktsent.
# 033074c4	09-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Replace 'struct route ' if_output() argument with 'struct nhop_info '. Leave 'struct route' as is for legacy routing api users. Remove most of rtalloc_ign*-derived functions.
# 55e5eda6	08-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Separate radix and routing: use different structures for route and for other customers. Introduce new 'struct rib_head' for routing purposes and make all routing api use it.
# 1398ffe5	08-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users to use new rt_foreach_fib() instead of hand-rolling cycles.
# 57c3556b	06-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix build. Pointy hat to: melifaro
# 146a181f	06-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Finish r274118: remove useless fields from struct domain. Sponsored by: Yandex LLC
# 1a75e3b2	06-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Make checks for rt_mtu generic: Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking might be an option in some situation, it is not feasible to do MTU checks there: generic (or per-domain) routing code is perfectly capable of doing this. We currrently have 3 places where MTU is altered: 1) route addition. In this case domain overrides radix _addroute callback (in[6]_addroute) and all necessary checks/fixes are/can be done there. 2) route change (especially, GW change). In this case, there are no explicit per-domain calls, but one can override rte by setting ifa_rtrequest hook to domain handler (inet6 does this). 3) ifconfig ifaceX mtu YYYY In this case, we have no callbacks, but ip[6]_output performes runtime checks and decreases rt_mtu if necessary. Generally, the goals are to be able to handle all MTU changes in control plane, not in runtime part, and properly deal with increased interface MTU. This commit changes the following: * removes hooks setting MTU from drivers side * adds proper per-doman MTU checks for case 1) * adds generic MTU check for case 2) * The latter is done by using new dom_ifmtu callback since if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size. However, IPv6 mtu might be different from if_mtu one (e.g. default 1280) for some cases, so we need an abstract way to know maximum MTU size for given interface and domain. * moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies user-supplied data which must be checked. * removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to use this functions on new non-inserted rte. More changes will follow soon. MFC after: 1 month Sponsored by: Yandex LLC
# 8c3cfe0b	04-Nov-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Hide 'struct rtentry' and all its macro inside new header: net/route_internal.h The goal is to make its opaque for all code except route/rtsock and proto domain _rmx.
# ee0bd4b9	20-Sep-2014	Hiroki Sato <hrs@FreeBSD.org>	Make net.add_addr_allfibs vnet-local.
# 4f8585e0	11-Sep-2014	Alan Somers <asomers@FreeBSD.org>	Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and ifa_ifwithdstaddr. For the sake of backwards compatibility, the new arguments were added to new functions named ifa_ifwithnet_fib and ifa_ifwithdstaddr_fib, while the old functions became wrappers around the new ones that passed RT_ALL_FIBS for the fib argument. However, the backwards compatibility is not desired for FreeBSD 11, because there are numerous other incompatible changes to the ifnet(9) API. We therefore decided to remove it from head but leave it in place for stable/9 and stable/10. In addition, this commit adds the fib argument to ifa_ifwithbroadaddr for consistency's sake. sys/sys/param.h Increment __FreeBSD_version sys/net/if.c sys/net/if_var.h sys/net/route.c Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute. sys/net/route.c sys/net/rtsock.c sys/netinet/in_pcb.c sys/netinet/ip_options.c sys/netinet/ip_output.c sys/netinet6/nd6.c Fixup calls of modified functions. share/man/man9/ifnet.9 Document changed API. CR: https://reviews.freebsd.org/D458 MFC after: Never Sponsored by: Spectra Logic
# af3b2549	27-Jun-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Pull in r267961 and r267973 again. Fix for issues reported will follow.
# 37a107a4	27-Jun-2014	Glen Barber <gjb@FreeBSD.org>	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory
# 3da1cf1e	27-Jun-2014	Hans Petter Selasky <hselasky@FreeBSD.org>	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies
# 2f308a34	29-May-2014	Alan Somers <asomers@FreeBSD.org>	Fix unintended KBI change from r264905. Add _fib versions of ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the _fib() versions with RT_ALL_FIBS, preserving legacy behavior. sys/net/if_var.h sys/net/if.c Add legacy-compatible functions as described above. Ensure legacy behavior when RT_ALL_FIBS is passed as fibnum. sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/net/route.c sys/net/rtsock.c sys/netinet6/nd6.c Call with _fib() functions if we must use a specific fib, or the legacy functions otherwise. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c Improve the udp_dontroute test. The bug that this test exercises is that ifa_ifwithnet() will return the wrong address, if multiple interfaces have addresses on the same subnet but with different fibs. The previous version of the test only considered one possible failure mode: that ifa_ifwithnet_fib() might fail to find any suitable address at all. The new version also checks whether ifa_ifwithnet_fib() finds the correct address by checking where the ARP request goes. Reported by: bz, hrs Reviewed by: hrs MFC after: 1 week X-MFC-with: 264905 Sponsored by: Spectra Logic
# 972ed56a	03-May-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove additional fib checks from rtalloc1_fib. It looks like current consumers are either unaware of MRT (and uses RT_DEFAULT_FIB implicitly) or know what thay are doing, In latter case they will be either hit by KASSERT or ESCRH will be returned due to NULL rnh.
# b980262e	03-May-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Pass radix head ptr along with rte to rtexpunge(). Rename rtexpunge to rt_expunge().
# 0fb9298d	29-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Move rt_setmetrics() from rtsock.c to route.c. All rtsock-initiated rte creation/modification are now performed in route.c holding radix tree write lock. This reduces the need for per-rte mutex. Sponsored by: Yandex LLC MFC after: 1 month
# a713ee5c	28-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Do not use senderr() in rtrequest1_fib_change(). Suggested by: glebius MFC after: 4 weeks
# f59c6cb0	26-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Remove useless `register' declarations. MFC after: 1 month
# c77462dd	26-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Decouple RTM_CHANGE from RTM_GET handling in rtsock.c:route_output(). RTM_CHANGE is now handled inside route.c:rtrequest1_fib() as it should be. Note change change handler is a separate function rtrequest1_fib_change(). MFC after: 1 month
# 36d55f0f	26-Apr-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Unify sa_equal() macro usage. MFC after: 2 weeks
# 0cfee0c2	24-Apr-2014	Alan Somers <asomers@FreeBSD.org>	Fix subnet and default routes on different FIBs on the same subnet. These two bugs are closely related. The root cause is that ifa_ifwithnet does not consider FIBs when searching for an interface address. sys/net/if_var.h sys/net/if.c Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those functions will only return an address whose interface fib equals the argument. sys/net/route.c Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib arguments. sys/netinet/in.c Update in_addprefix to consider the interface fib when adding prefixes. This will prevent it from not adding a subnet route when one already exists on a different fib. sys/net/rtsock.c sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/netinet6/nd6.c Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet. In some cases it there wasn't a clear specific fib number to use. In others, I was unable to test those functions so I chose RT_DEFAULT_FIB to minimize divergence from current behavior. I will fix some of the latter changes along with PR kern/187553. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c tests/sys/netinet/Makefile Revert r263738. The udp_dontroute test was right all along. However, bugs kern/187550 and kern/187553 cancelled each other out when it came to this test. Because of kern/187553, ifa_ifwithnet searched the default fib instead of the requested one, but because of kern/187550, there was an applicable subnet route on the default fib. The new test added in r263738 doesn't work right, however. I can verify with dtrace that ifa_ifwithnet returned the wrong address before I applied this commit, but route(8) miraculously found the correct interface to use anyway. I don't know how. Clear expected failure messages for kern/187550 and kern/187552. PR: kern/187550 PR: kern/187552 Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic
# 0489b891	24-Apr-2014	Alan Somers <asomers@FreeBSD.org>	Fix host and network routes for new interfaces when net.add_addr_allfibs=0 sys/net/route.c In rtinit1, use the interface fib instead of the process fib. The latter wasn't very useful because ifconfig(8) is usually invoked with the default process fib. Changing ifconfig(8) to use setfib(2) would be redundant, because it already sets the interface fib. tests/sys/netinet/fibs_test.sh Clear the expected ATF failure sys/net/if.c Pass the interface fib in calls to rtrequest1_fib and rtalloc1_fib sys/netinet/in.c sys/net/if_var.h Add a fibnum argument to ifa_switch_loopback_route, a subroutine of in_scrubprefix. Pass it the interface fib. PR: kern/187549 Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic Corporation
# 7f946da0	07-Apr-2014	Michael Tuexen <tuexen@FreeBSD.org>	Call sctp_addr_change() from rt_addrmsg() instead of rt_newaddrmsg_fib(), since rt_addrmsg() gets also called from other functions. MFC after: 3 days
# 66dcee72	15-Mar-2014	Gleb Smirnoff <glebius@FreeBSD.org>	Garbage collect long time obsoleted (or never used) stuff from routing API.
# 256ea2ab	05-Mar-2014	Gleb Smirnoff <glebius@FreeBSD.org>	The route code used to mtx_destroy() a locked mutex before rtentry free. Now, after r262763 it started to return locked mutexes to UMA. To fix that, conditionally unlock the mutex in the destructor. Tested by: "Sergey V. Dyatko" <sergey.dyatko@gmail.com>
# e3a7aa6f	04-Mar-2014	Gleb Smirnoff <glebius@FreeBSD.org>	- Remove rt_metrics_lite and simply put its members into rtentry. - Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This removes another cache trashing ++ from packet forwarding path. - Create zini/fini methods for the rtentry UMA zone. Via initialize mutex and counter in them. - Fix reporting of rmx_pksent to routing socket. - Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode. The change is mostly targeted for stable/10 merge. For head, rt_pksent is expected to just disappear. Discussed with: melifaro Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 9b5f5ede	04-Mar-2014	George V. Neville-Neil <gnn@FreeBSD.org>	Revert previous commit (262727) and bounce patch back to the submitter. Pointed out by: jhb
# 596031c0	03-Mar-2014	George V. Neville-Neil <gnn@FreeBSD.org>	Naming consistency fix. The routing code defines RADIX_NODE_HEAD_LOCK as grabbing the write lock, but RADIX_NODE_HEAD_LOCK_ASSERT as checking the read lock. Submitted by: Vijay Singh <vijju.singh at gmail.com> MFC after: 1 month
# 5d6d7e75	07-Feb-2014	Gleb Smirnoff <glebius@FreeBSD.org>	o Revamp API between flowtable and netinet, netinet6. - ip_output() and ip_output6() simply call flowtable_lookup(), passing mbuf and address family. That's the only code under #ifdef FLOWTABLE in the protocols code now. o Revamp statistics gathering and export. - Remove hand made pcpu stats, and utilize counter(9). - Snapshot of statistics is available via 'netstat -rs'. - All sysctls are moved into net.flowtable namespace, since spreading them over net.inet isn't correct. o Properly separate at compile time INET and INET6 parts. o General cleanup. - Remove chain of multiple flowtables. We simply have one for IPv4 and one for IPv6. - Flowtables are allocated in flowtable.c, symbols are static. - With proper argument to SYSINIT() we no longer need flowtable_ready. - Hash salt doesn't need to be per-VNET. - Removed rudimentary debugging, which use quite useless in dtrace era. The runtime behavior of flowtable shouldn't be changed by this commit. Sponsored by: Netflix Sponsored by: Nginx, Inc.
# d375edc9	09-Jan-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Simplify inet alias handling code: if we're adding/removing alias which has the same prefix as some other alias on the same interface, use newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg(). This eliminates the following rtsock messages: Pinned RTM_ADD for prefix (for alias addition). Pinned RTM_DELETE for prefix (for alias withdrawal). Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24): before commit, addition: got message of size 116 on Fri Jan 10 14:13:15 2014 RTM_NEWADDR: address being added to iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 got message of size 192 on Fri Jan 10 14:13:15 2014 RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 10.0.0.2 (255) ffff ffff ff after commit, addition: got message of size 116 on Fri Jan 10 13:56:26 2014 RTM_NEWADDR: address being added to iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255 before commit, wihdrawal: got message of size 192 on Fri Jan 10 13:58:59 2014 RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 10.0.0.2 (255) ffff ffff ff got message of size 116 on Fri Jan 10 13:58:59 2014 RTM_DELADDR: address being removed from iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 adter commit, withdrawal: got message of size 116 on Fri Jan 10 14:14:11 2014 RTM_DELADDR: address being removed from iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong (and requires some hacks to keep prefix in route table on RTM_DELETE). I've tested this change with quagga (no change) and bird (). bird alias handling is already broken in BSD sysdep code, so nothing changes here, too. I'm going to MFC this change if there will be no complains about behavior change. While here, fix some style(9) bugs introduced by r260488 (pointed by glebius and bde). Sponsored by: Yandex LLC MFC after: 4 weeks
# 4cbac30b	09-Jan-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Split rt_newaddrmsg_fib() into two different functions. Adding/deleting interface addresses involves access to 3 different subsystems, int different parts of code. Each call can fail, so reporting successful operation by rtsock in the middle of the process error-prone. Further split routing notification API and actual rtsock calls via creating public-available rt_addrmsg() / rt_routemsg() functions with "private" rtsock_* backend. MFC after: 2 weeks
# 7d9b6df1	08-Jan-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Constanly use RT_ALL_FIBS everywhere instead of -1. MFC after: 2 weeks
# 034c09ff	06-Jan-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Partially fix IPv4 interface routes deletion in RADIX_MPATH. Noticed by: Nikolay Denev <ndenev at gmail.com> MFC after: 1 month
# 5a2f4cbd	04-Jan-2014	Alexander V. Chernikov <melifaro@FreeBSD.org>	Change semantics for rnh_lookup() function: now it performs exact match search, regardless of netmask existance. This simplifies most of rnh_lookup() consumers. Fix panic triggered by deleting non-existent host route. PR: kern/185092 Submitted by: Nikolay Denev <ndenev at gmail.com> MFC after: 1 month
# 6274ce3e	25-Nov-2013	Craig Rodrigues <rodrigc@FreeBSD.org>	In vnet_route_uninit(), free some memory that is allocated in vnet_route_init(). To reproduce the problem: (1) Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS, INVARIANTS. (2) Run this command in a loop: jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo see: http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021291.html This doesn't eliminate all the "Freed UMA keg was not empty" warning messages on the console, but it helps.
# 76039bc8	26-Oct-2013	Gleb Smirnoff <glebius@FreeBSD.org>	The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare to this event, adding if_var.h to files that do need it. Also, include all includes that now are included due to implicit pollution via if_var.h Sponsored by: Netflix Sponsored by: Nginx, Inc.
# 65a17d74	15-Oct-2013	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix long-standing issue with incorrect radix mask calculation. Usual symptoms are messages like rn_delete: inconsistent annotation rn_addmask: mask impossibly already in tree or inability to flush/delete particular prefix in ipfw table. Changes: * Assume 32 bytes as maximum radix key length * Remove rn_init() * Statically allocate rn_ones/rn_zeroes * Make separate mask tree for each "normal" tree instead of system global one * Remove "optimization" on masks reusage and key zeroying * Change rn_addmask() arguments to accept tree pointer (no users in base) PR: kern/182851, kern/169206, kern/135476, kern/134531 Found by: Slawa Olhovchenkov <slw@zxy.spb.ru> MFC after: 2 weeks Reviewed by: glebius Sponsored by: Yandex LLC
# d54455b0	18-May-2013	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix rte leak introduced in r248070. MFC after: 2 weeks
# 4871fc4a	16-May-2013	Julian Elischer <julian@FreeBSD.org>	Finally change the mbuf to have its own fib field instead of stealing 4 flag bits. This was supposed to happen in 8.0, and again in 2012.. MFC after: never
# 3034f43f	08-Mar-2013	Alexander V. Chernikov <melifaro@FreeBSD.org>	Fix long-standing issue with interface routes being unprotected: Use RTM_PINNED flag to mark route as immutable. Forbid deleting immutable routes without special rtrequest1_fib() flag. Adding interface address with prefix already in route table is handled by atomically deleting old prefix and adding interface one. Discussed with: andre, eri MFC after: 3 weeks
# 14126522	05-Mar-2013	Alexander V. Chernikov <melifaro@FreeBSD.org>	Write lock is not required for find&compare operation. MFC after: 2 weeks
# bfca216e	18-Mar-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Hide kernel option ROUTETABLES evaluations in the implementation rather than the header file. With this also move RT_MAXFIBS and RT_NUMFIBS into the implemantion to avoid further usage in other code. rt_numfibs is all that should be needed. This allows users to change the number of FIBs from 1..RT_MAXFIBS(16) dynamically using the tunable without the need to change the kernel config for the maximum anymore. This means that thet multi-FIB feature is now fully available with GENERIC kernels. The kernel option ROUTETABLES can still be used to set the default numbers of FIBs in absence of the tunable. Ok.ed by: julian, hrs, melifaro MFC after: 2 weeks
# a8498625	02-Feb-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Move a comment from rtinit1() to the top of the file where dealing with the (maximum) number of FIBs trying to clarify that evetually FIBs should probably attached to domain(9) specific storage. [1] Add a comment on a limitimation on the rt_add_addr_allfibs option. Use RT_DEFAULT_FIB instead of 0 where applicable. Add empty line to functions without local variables per style. Put public yet unused in-tree function rtinit_fib() under BURN_BRIDGES to indicate that it might go away in the future. No functional change. Discussed with: julian [1] (clarification on what the original one meant) Sponsored by: Cisco Systems, Inc.
# b3dd0771	03-Feb-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Minor optimization doing input validation with a possible early return before doing further work. Sponsored by: Cisco Systems, Inc.
# 096f2786	03-Feb-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix FLOWTABLE IPv6 handling in route.c missed in r205066. While doing so, for consistency with the rtalloc_ign_fib(9) interface called, remove the "in_" prefix from rtalloc_ign_wrapper() no longer indicating that it would only handle the INET case. Sponsored by: Cisco Systems, Inc.
# b680a383	03-Feb-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Allow for IPv6 to allocate (and in the VIMAGE case free) as many routing tables (FIBs) as IPv4. Prepare various general rt* functions for multi-FIB IPv6 handling in addition to already existing multi-FIB IPv4 cases. Sponsored by: Cisco Systems, Inc.
# 8d74af36	24-Jan-2012	Bjoern A. Zeeb <bz@FreeBSD.org>	Replace random ARIN direct assignment legacy IPs with proper RFC 5735 TEST-NET1 block for use in documentation and example code addresses. MFC after: 3 days
# f3909e37	14-Dec-2011	Gleb Smirnoff <glebius@FreeBSD.org>	Simplify rtrequest(RTM_ADD): ifa can't be NULL after rt_getifa_fib().
# 46a70de2	24-Oct-2011	Qing Li <qingli@FreeBSD.org>	The host-id/interface-id can have a specific value and is properly masked out when adding a prefix route through the "route" command. However, when deleting the route, simply changing the command keyword from "add" to "delete" does not work. The failoure is observed in both IPv4 and IPv6 route insertion. The patch makes the route command behavior consistent between the "add" and the "delete" operation. MFC after: 1 week
# 528737fd	28-Sep-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Pass the fibnum where we need filtering of the message on the rtsock allowing routing daemons to filter routing updates on an rtsock per FIB. Adjust raw_input() and split it into wrapper and a new function taking an optional callback argument even though we only have one consumer [1] to keep the hackish flags local to rtsock.c. PR: kern/134931 Submitted by: multiple (see PR) Suggested by: rwatson [1] Reviewed by: rwatson MFC after: 3 days
# 8451d0dd	16-Sep-2011	Kip Macy <kmacy@FreeBSD.org>	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
# e9ff3d45	07-Aug-2011	Kevin Lo <kevlo@FreeBSD.org>	In rtinit1(), before rtrequest1_fib() is called, info.rti_flags is initialized by flags (function argument) or-ed with ifa->ifa_flags. If both NIC has a loopback route to itself, so IFA_RTSELF is set on ifa(s). As IFA_RTSELF is defined by RTF_HOST, rtrequest1_fib() is called with RTF_HOST flag even if netmask is not NULL. Consequently, netmask is set to zero in rtrequest1_fib(), and request to add network route is changed under hands to request to add host route. Tested by: Andrew Boyer <aboyer at averesystems.com> Submitted by: Svatopluk Kraus <onwahe at gmail dot com> Approved by: re (hrs)
# f5857e2d	21-Jun-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Garbage collect never used global, sysctl, externs. MFC after: 1 week
# b8b8e0c9	19-Jun-2011	Bjoern A. Zeeb <bz@FreeBSD.org>	Leave an extra comment about flowtable and IPv6 support rectifying a previous comment. MFC after: 1 week
# e579f1c1	19-Mar-2011	Dmitry Chagin <dchagin@FreeBSD.org>	ouch, newrt is used on the return path, my fault. Partialy revert the previous change. MFC after: 1 Week.
# 523e6002	19-Mar-2011	Dmitry Chagin <dchagin@FreeBSD.org>	A bit rearranged rtalloc1_fib() code. Initialize a variable when it is really needed. To avoid code duplication move the miss label to line up and jump on it. MFC after: 1 Week
# 6a873ef7	19-Mar-2011	Dmitry Chagin <dchagin@FreeBSD.org>	Remove a now unused variable. MFC after: 1 Week
# 6bccea7c	21-Feb-2011	Rebecca Cran <brucec@FreeBSD.org>	Fix typos - remove duplicate "the". PR: bin/154928 Submitted by: Eitan Adler <lists at eitanadler.com> MFC after: 3 days
# f88910cd	12-Jan-2011	Matthew D Fleming <mdf@FreeBSD.org>	sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly. Commit the net* piece.
# 3e288e62	22-Nov-2010	Dimitry Andric <dim@FreeBSD.org>	After some off-list discussion, revert a number of changes to the DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless. Changes reverted: ------------------------------------------------------------------------ r215318 \| dim \| 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) \| 4 lines Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined. ------------------------------------------------------------------------ r215317 \| dim \| 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) \| 3 lines Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree. ------------------------------------------------------------------------ r215316 \| dim \| 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) \| 2 lines Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
# 31c6a003	14-Nov-2010	Dimitry Andric <dim@FreeBSD.org>	Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# dd62f5c0	25-Jun-2010	Qing Li <qingli@FreeBSD.org>	MFC r208553 This patch fixes the problem where proxy ARP entries cannot be added over the if_ng interface. Approved by: re (bz)
# 0ed6142b	25-May-2010	Qing Li <qingli@FreeBSD.org>	This patch fixes the problem where proxy ARP entries cannot be added over the if_ng interface. MFC after: 3 days
# 480d7c6c	06-May-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r207369: MFP4: @176978-176982, 176984, 176990-176994, 177441 "Whitspace" churn after the VIMAGE/VNET whirls. Remove the need for some "init" functions within the network stack, like pim6_init(), icmp_init() or significantly shorten others like ip6_init() and nd6_init(), using static initialization again where possible and formerly missed. Move (most) variables back to the place they used to be before the container structs and VIMAGE_GLOABLS (before r185088) and try to reduce the diff to stable/7 and earlier as good as possible, to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9. This also removes some header file pollution for putatively static global variables. Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are no longer needed. Reviewed by: jhb Discussed with: rwatson Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH
# 82cea7e6	29-Apr-2010	Bjoern A. Zeeb <bz@FreeBSD.org>	MFP4: @176978-176982, 176984, 176990-176994, 177441 "Whitspace" churn after the VIMAGE/VNET whirls. Remove the need for some "init" functions within the network stack, like pim6_init(), icmp_init() or significantly shorten others like ip6_init() and nd6_init(), using static initialization again where possible and formerly missed. Move (most) variables back to the place they used to be before the container structs and VIMAGE_GLOABLS (before r185088) and try to reduce the diff to stable/7 and earlier as good as possible, to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9. This also removes some header file pollution for putatively static global variables. Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are no longer needed. Reviewed by: jhb Discussed with: rwatson Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH MFC after: 6 days
# c951da56	01-Apr-2010	Qing Li <qingli@FreeBSD.org>	MFC 204902 One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to allow for connection load balancing across interfaces. Currently the address alias handling method is colliding with the ECMP code. For example, when two interfaces are configured on the same prefix, only one prefix route is installed. So connection load balancing among the available interfaces is not possible. The other advantage of ECMP is for failover. The issue with the current code, is that the interface link-state is not reflected in the route entry. For example, if there are two interfaces on the same prefix, the cable on one interface is unplugged, new and existing connections should switch over to the other interface. This is not done today and packets go into a black hole. Also, there is a small bug in the kernel where deleting ECMP routes in the userland will always return an error even though the command is successfully executed.
# 8018e843	23-Mar-2010	Luigi Rizzo <luigi@FreeBSD.org>	MFC of a large number of ipfw and dummynet fixes and enhancements done in CURRENT over the last 4 months. HEAD and RELENG_8 are almost in sync now for ipfw, dummynet the pfil hooks and related components. Among the most noticeable changes: - r200855 more efficient lookup of skipto rules, and remove O(N) blocks from critical sections in the kernel; - r204591 large restructuring of the dummynet module, with support for multiple scheduling algorithms (4 available so far) See the original commit logs for details. Changes in the kernel/userland ABI should be harmless because the kernel is able to understand previous requests from RELENG_8 and RELENG_7. For this reason, this changeset would be applicable to RELENG_7 as well, but i am not sure if it is worthwhile.
# c7ea0aa6	08-Mar-2010	Qing Li <qingli@FreeBSD.org>	One of the advantages of enabling ECMP (a.k.a RADIX_MPATH) is to allow for connection load balancing across interfaces. Currently the address alias handling method is colliding with the ECMP code. For example, when two interfaces are configured on the same prefix, only one prefix route is installed. So connection load balancing among the available interfaces is not possible. The other advantage of ECMP is for failover. The issue with the current code, is that the interface link-state is not reflected in the route entry. For example, if there are two interfaces on the same prefix, the cable on one interface is unplugged, new and existing connections should switch over to the other interface. This is not done today and packets go into a black hole. Also, there is a small bug in the kernel where deleting ECMP routes in the userland will always return an error even though the command is successfully executed. MFC after: 5 days
# 32c53401	05-Jan-2010	Qing Li <qingli@FreeBSD.org>	MFC r201282, r201543 r201282 ------- The proxy arp entries could not be added into the system over the IFF_POINTOPOINT link types. The reason was due to the routing entry returned from the kernel covering the remote end is of an interface type that does not support ARP. This patch fixes this problem by providing a hint to the kernel routing code, which indicates the prefix route instead of the PPP host route should be returned to the caller. Since a host route to the local end point is also added into the routing table, and there could be multiple such instantiations due to multiple PPP links can be created with the same local end IP address, this patch also fixes the loopback route installation failure problem observed prior to this patch. The reference count of loopback route to local end would be either incremented or decremented. The first instantiation would create the entry and the last removal would delete the route entry. r201543 ------- The IFA_RTSELF address flag marks a loopback route has been installed for the interface address. This marker is necessary to properly support PPP types of links where multiple links can have the same local end IP address. The IFA_RTSELF flag bit maps to the RTF_HOST value, which was combined into the route flag bits during prefix installation in IPv6. This inclusion causing the prefix route to be unusable. This patch fixes this bug by excluding the IFA_RTSELF flag during route installation. PR: ports/141342, kern/141134
# c7ab6602	30-Dec-2009	Qing Li <qingli@FreeBSD.org>	The proxy arp entries could not be added into the system over the IFF_POINTOPOINT link types. The reason was due to the routing entry returned from the kernel covering the remote end is of an interface type that does not support ARP. This patch fixes this problem by providing a hint to the kernel routing code, which indicates the prefix route instead of the PPP host route should be returned to the caller. Since a host route to the local end point is also added into the routing table, and there could be multiple such instantiations due to multiple PPP links can be created with the same local end IP address, this patch also fixes the loopback route installation failure problem observed prior to this patch. The reference count of loopback route to local end would be either incremented or decremented. The first instantiation would create the entry and the last removal would delete the route entry. MFC after: 5 days
# 614cb839	14-Dec-2009	Luigi Rizzo <luigi@FreeBSD.org>	Move the scan for max_keylen into route.c::route_init(), and make max_keylen an argument for rn_init(). This removes an unnecessary dependency on domain.h from radix.c MFC after: 7 days
# cf19fced	07-Dec-2009	Michael Tuexen <tuexen@FreeBSD.org>	MFC 197288,197326,197327,197328,197342,197914,197929, 197955,199365,199370,199371,199373,199866 This MFCs all SCTP/VNET relevant fixes from head. Approved by: rrs (mentor)
# 7f279720	16-Nov-2009	Michael Tuexen <tuexen@FreeBSD.org>	Fix a LOR showing up with sctp_bsd_addr(): Do not hold a rt lock when calling rt_newaddrmsg(). Reviewed by: qingli Approved by: rrs (mentor) MFC after: 1 month
# 67f0b21f	08-Oct-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	MFC r197727: Put #ifdef INET around parts of the FLOWTABLE code, to unbreak nooptions INET kernel builds. Approved by: re (kib)
# e85f0cc5	06-Oct-2009	Qing Li <qingli@FreeBSD.org>	MFC r197687 The flow-table associates TCP/UDP flows and IP destinations with specific routes. When the routing table changes, for example, when a new route with a more specific prefix is inserted into the routing table, the flow-table is not updated to reflect that change. As such existing connections cannot take advantage of the new path. In some cases the path is broken. This patch will update the affected flow-table entries when a more specific route is added. The route entry is properly marked when a route is deleted from the table. In this case, when the flow-table performs a search, the stale entry is updated automatically. Therefore this patch is not necessary for route deletion. Reviewed by: bz, kmacy Approved by: re
# db44ff40	03-Oct-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Put #ifdef INET around parts of the FLOWTABLE code, to unbreak nooptions INET kernel builds. MFC after: 3 days X-MFC: with r197687
# e5c610d6	01-Oct-2009	Qing Li <qingli@FreeBSD.org>	The flow-table associates TCP/UDP flows and IP destinations with specific routes. When the routing table changes, for example, when a new route with a more specific prefix is inserted into the routing table, the flow-table is not updated to reflect that change. As such existing connections cannot take advantage of the new path. In some cases the path is broken. This patch will update the affected flow-table entries when a more specific route is added. The route entry is properly marked when a route is deleted from the table. In this case, when the flow-table performs a search, the stale entry is updated automatically. Therefore this patch is not necessary for route deletion. Submitted by: simon, phk Reviewed by: bz, kmacy MFC after: 3 days
# 530c0060	01-Aug-2009	Robert Watson <rwatson@FreeBSD.org>	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# d0728d71	23-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Introduce and use a sysinit-based initialization scheme for virtual network stacks, VNET_SYSINIT: - Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will occur each time a network stack is instantiated and destroyed. In the !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT. For the VIMAGE case, we instead use SYSINIT's to track their order and properties on registration, using them for each vnet when created/ destroyed, or immediately on module load for already-started vnets. - Remove vnet_modinfo mechanism that existed to serve this purpose previously, as well as its dependency scheme: we now just use the SYSINIT ordering scheme. - Implement VNET_DOMAIN_SET() to allow protocol domains to declare that they want init functions to be called for each virtual network stack rather than just once at boot, compiling down to DOMAIN_SET() in the non-VIMAGE case. - Walk all virtualized kernel subsystems and make use of these instead of modinfo or DOMAIN_SET() for init/uninit events. In some cases, convert modular components from using modevent to using sysinit (where appropriate). In some cases, do minor rejuggling of SYSINIT ordering to make room for or better manage events. Portions submitted by: jhb (VNET_SYSINIT), bz (cleanup) Discussed with: jhb, bz, julian, zec Reviewed by: bz Approved by: re (VIMAGE blanket)
# 1e77c105	16-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Remove unused VNET_SET() and related macros; only VNET_GET() is ever actually used. Rename VNET_GET() to VNET() to shorten variable references. Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib)
# eddfbb76	14-Jul-2009	Robert Watson <rwatson@FreeBSD.org>	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
# 6a7bff2c	11-Jul-2009	Kip Macy <kmacy@FreeBSD.org>	Re-factoring for adding weighted routes introduced a fairly irritating bug where the system will panic when RADIX_MPATH is enabled. This change fixes this. Approved by: re@
# 8c0fec80	23-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Modify most routines returning 'struct ifaddr *' to return references rather than pointers, requiring callers to properly dispose of those references. The following routines now return references: ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr Remove unused macro which didn't have required referencing: IFP_TO_IA6 This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references. Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed. Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions)
# b58ea5f3	22-Jun-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Move virtualization of routing related variables into their own Vimage module, which had been there already but now is stateful. All variables are now file local; so this further limits the global spreading of routing related things throughout the kernel. Add a missing function local variable in case of MPATHing. Reviewed by: zec
# f987f193	22-Jun-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Collect all VIMAGE_GLOBALS variables in one place. No longer export rt_tables as all lookups go through rt_tables_get_rnh(). We cannot make rt_tables (and rtstat, rttrash[1]) static as netstat -r (-rs[1]) would stop working on a stripped VIMAGE_GLOBALS kernel. Reviewed by: zec Presumably broken by: phk 13.5y ago in r12820 [1]
# 8896f83a	22-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Add a new function, ifa_ifwithaddr_check(), which rather than returning a pointer to an ifaddr matching the passed socket address, returns a boolean indicating whether one was present. In the (near) future, ifa_ifwithaddr() will return a referenced ifaddr rather than a raw ifaddr pointer, and the new wrapper will allow callers that care only about the boolean condition to avoid having to free that reference. MFC after: 3 weeks
# 1099f828	21-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Clean up common ifaddr management: - Unify reference count and lock initialization in a single function, ifa_init(). - Move tear-down from a macro (IFAFREE) to a function ifa_free(). - Move reference count bump from a macro (IFAREF) to a function ifa_ref(). - Instead of using a u_int protected by a mutex to refcount(9) for reference count management. The ifa_mtx is now used for exactly one ioctl, and possibly should be removed. MFC after: 3 weeks
# bc29160d	08-Jun-2009	Marko Zec <zec@FreeBSD.org>	Introduce an infrastructure for dismantling vnet instances. Vnet modules and protocol domains may now register destructor functions to clean up and release per-module state. The destructor mechanisms can be triggered by invoking "vimage -d", or a future equivalent command which will be provided via the new jail framework. While this patch introduces numerous placeholder destructor functions, many of those are currently incomplete, thus leaking memory or (even worse) failing to stop all running timers. Many of such issues are already known and will be incrementaly fixed over the next weeks in smaller incremental commits. Apart from introducing new fields in structs ifnet, domain, protosw and vnet_net, which requires the kernel and modules to be rebuilt, this change should have no impact on nooptions VIMAGE builds, since vnet destructors can only be called in VIMAGE kernels. Moreover, destructor functions should be in general compiled in only in options VIMAGE builds, except for kernel modules which can be safely kldunloaded at run time. Bump __FreeBSD_version to 800097. Reviewed by: bz, julian Approved by: rwatson, kib (re), julian (mentor)
# c2c2a7c1	01-Jun-2009	Bjoern A. Zeeb <bz@FreeBSD.org>	Convert the two dimensional array to be malloced and introduce an accessor function to get the correct rnh pointer back. Update netstat to get the correct pointer using kvm_read() as well. This not only fixes the ABI problem depending on the kernel option but also permits the tunable to overwrite the kernel option at boot time up to MAXFIBS, enlarging the number of FIBs without having to recompile. So people could just use GENERIC now. Reviewed by: julian, rwatson, zec X-MFC: not possible
# d7fcc528	01-May-2009	Marko Zec <zec@FreeBSD.org>	Unbreak options VIMAGE + nooptions INVARIANTS kernel builds. Submitted by: julian Approved by: julian (mentor)
# 093f25f8	26-Apr-2009	Marko Zec <zec@FreeBSD.org>	In preparation for turning on options VIMAGE in next commits, rearrange / replace / adjust several INIT_VNET_* initializer macros, all of which currently resolve to whitespace. Reviewed by: bz (an older version of the patch) Approved by: julian (mentor)
# 427ac07f	14-Apr-2009	Kip Macy <kmacy@FreeBSD.org>	Extend route command: - add show as alias for get - add weights to allow mpath to do more than equal cost - add sticky / nostick to disable / re-enable per-connection load balancing This adds a field to rt_metrics_lite so network bits of world will need to be re-built. Reviewed by: jeli & qingli
# bfe1aba4	10-Apr-2009	Marko Zec <zec@FreeBSD.org>	Introduce vnet module registration / initialization framework with dependency tracking and ordering enforcement. With this change, per-vnet initialization functions introduced with r190787 are no longer directly called from traditional initialization functions (which cc in most cases inlined to pre-r190787 code), but are instead registered via the vnet framework first, and are invoked only after all prerequisite modules have been initialized. In the long run, this framework should allow us to both initialize and dismantle multiple vnet instances in a correct order. The problem this change aims to solve is how to replay the initialization sequence of various network stack components, which have been traditionally triggered via different mechanisms (SYSINIT, protosw). Note that this initialization sequence was and still can be subtly different depending on whether certain pieces of code have been statically compiled into the kernel, loaded as modules by boot loader, or kldloaded at run time. The approach is simple - we record the initialization sequence established by the traditional mechanisms whenever vnet_mod_register() is called for a particular vnet module. The vnet_mod_register_multi() variant allows a single initializer function to be registered multiple times but with different arguments - currently this is only used in kern/uipc_domain.c by net_add_domain() with different struct domain * as arguments, which allows for protosw-registered initialization routines to be invoked in a correct order by the new vnet initialization framework. For the purpose of identifying vnet modules, each vnet module has to have a unique ID, which is statically assigned in sys/vimage.h. Dynamic assignment of vnet module IDs is not supported yet. A vnet module may specify a single prerequisite module at registration time by filling in the vmi_dependson field of its vnet_modinfo struct with the ID of the module it depends on. Unless specified otherwise, all vnet modules depend on VNET_MOD_NET (container for ifnet list head, rt_tables etc.), which thus has to and will always be initialized first. The framework will panic if it detects any unresolved dependencies before completing system initialization. Detection of unresolved dependencies for vnet modules registered after boot (kldloaded modules) is not provided. Note that the fact that each module can specify only a single prerequisite may become problematic in the long run. In particular, INET6 depends on INET being already instantiated, due to TCP / UDP structures residing in INET container. IPSEC also depends on INET, which will in turn additionally complicate making INET6-only kernel configs a reality. The entire registration framework can be compiled out by turning on the VIMAGE_GLOBALS kernel config option. Reviewed by: bz Approved by: julian (mentor)
# 1ed81b73	06-Apr-2009	Marko Zec <zec@FreeBSD.org>	First pass at separating per-vnet initializer functions from existing functions for initializing global state. At this stage, the new per-vnet initializer functions are directly called from the existing global initialization code, which should in most cases result in compiler inlining those new functions, hence yielding a near-zero functional change. Modify the existing initializer functions which are invoked via protosw, like ip_init() et. al., to allow them to be invoked multiple times, i.e. per each vnet. Global state, if any, is initialized only if such functions are called within the context of vnet0, which will be determined via the IS_DEFAULT_VNET(curvnet) check (currently always true). While here, V_irtualize a few remaining global UMA zones used by net/netinet/netipsec networking code. While it is not yet clear to me or anybody else whether this is the right thing to do, at this stage this makes the code more readable, and makes it easier to track uncollected UMA-zone-backed objects on vnet removal. In the long run, it's quite possible that some form of shared use of UMA zone pools among multiple vnets should be considered. Bump __FreeBSD_version due to changes in layout of structs vnet_ipfw, vnet_inet and vnet_net. Approved by: julian (mentor)
# a42ea597	02-Jan-2009	Qing Li <qingli@FreeBSD.org>	The log message should terminate with a newline instead of a tab character.
# 7b4d716b	15-Dec-2008	Kip Macy <kmacy@FreeBSD.org>	style and spelling fix
# 6e6b3f7c	14-Dec-2008	Qing Li <qingli@FreeBSD.org>	This main goals of this project are: 1. separating L2 tables (ARP, NDP) from the L3 routing tables 2. removing as much locking dependencies among these layers as possible to allow for some parallelism in the search operations 3. simplify the logic in the routing code, The most notable end result is the obsolescent of the route cloning (RTF_CLONING) concept, which translated into code reduction in both IPv4 ARP and IPv6 NDP related modules, and size reduction in struct rtentry{}. The change in design obsoletes the semantics of RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland applications such as "arp" and "ndp" have been modified to reflect those changes. The output from "netstat -r" shows only the routing entries. Quite a few developers have contributed to this project in the past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and Andre Oppermann. And most recently: - Kip Macy revised the locking code completely, thus completing the last piece of the puzzle, Kip has also been conducting active functional testing - Sam Leffler has helped me improving/refactoring the code, and provided valuable reviews - Julian Elischer setup the perforce tree for me and has helped me maintaining that branch before the svn conversion
# 9b20205d	10-Dec-2008	Kip Macy <kmacy@FreeBSD.org>	fix a reported panic when adding a route and one hit here when deleting a route - pass RTF_RNH_LOCKED to rtalloc1_fib in 2 cases where the lock is held - make sure the rnh lock is held across rt_setgate and rt_getifa_fib
# 4e5fd766	09-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix a bug introduced in r185747: rather than dereferencing an uninitialized *rt to something undefined, use the fibnum that came in as function argument. Found with: Coverity Prevent(tm) CID: 4168
# c96b8224	08-Dec-2008	Kip Macy <kmacy@FreeBSD.org>	- avoid recursively locking the radix node head lock - assert that it is held if RTF_RNH_LOCKED is not passed
# 3120b9d4	07-Dec-2008	Kip Macy <kmacy@FreeBSD.org>	- convert radix node head lock from mutex to rwlock - make radix node head lock not recursive - fix LOR in rtexpunge - fix LOR in rtredirect Reviewed by: sam
# 4b79449e	02-Dec-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
# 97021c24	26-Nov-2008	Marko Zec <zec@FreeBSD.org>	Merge more of currently non-functional (i.e. resolving to whitespace) macros from p4/vimage branch. Do a better job at enclosing all instantiations of globals scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks. De-virtualize and mark as const saorder_state_alive and saorder_state_any arrays from ipsec code, given that they are never updated at runtime, so virtualizing them would be pointless. Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 44e33a07	19-Nov-2008	Marko Zec <zec@FreeBSD.org>	Change the initialization methodology for global variables scheduled for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 8b615593	02-Oct-2008	Marko Zec <zec@FreeBSD.org>	Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(). () netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
# 4e7840e2	20-Sep-2008	Marko Zec <zec@FreeBSD.org>	Move #defines for MRT-related constants from net/route.c to net/route.h, because the vnet code will need those constants as well. Reviewed by: bz Approved by: julian (mentor) MFC after: never
# 3ee01afa	15-Sep-2008	Julian Elischer <julian@FreeBSD.org>	Hey, committed the same typo twice! must be a record
# 1d3ab08a	14-Sep-2008	Julian Elischer <julian@FreeBSD.org>	rewrite rt_check. Ztake into account that whiel teh rtentry is unlocked, someone else might change it, so after we re-acquire the lock on it, we need to check it is still valid. People have been panicing in this function due to soem edge cases which I have hopefully removed. Reviewed by: keramida @ Obtained from: 1 week
# 5e7b481a	14-Sep-2008	Julian Elischer <julian@FreeBSD.org>	come on Julian, make up if you're committing one change or the other. fix braino
# 93fcb5a2	14-Sep-2008	Julian Elischer <julian@FreeBSD.org>	Revert a part of the MRT commit that proved un-needed. rt_check() in its original form proved to be sufficient and rt_check_fib() can go away (as can its evil twin in_rt_check()). I believe this does NOT address the crashes people have been seeing in rt_check. MFC after: 1 week
# c7cacf27	01-Sep-2008	Brooks Davis <brooks@FreeBSD.org>	Wrap a line that became too long with the addition of V_. (This file contains many more unwrapped or badly wrapped lines.)
# 603724d3	17-Aug-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
# 66e8505f	26-Jul-2008	Julian Elischer <julian@FreeBSD.org>	Add the ability to add new addresses for interfacesto just one FIB (Other more specific related options will follow) This allows one to set multiple p2p links to the same place and select which to use by having each in different FIBS.
# 6f95a5eb	09-May-2008	Julian Elischer <julian@FreeBSD.org>	move a #define from a place it shouldn't have been to a place it should have been. Basically my testign didn't ocver one case that this broke. thanks tinderbox!
# 9ac73669	09-May-2008	Julian Elischer <julian@FreeBSD.org>	undef MAXFIBS before redefining it
# 8b07e49a	09-May-2008	Julian Elischer <julian@FreeBSD.org>	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco
# ea9cd9f2	13-Apr-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	Fix the build in case RADIX_MPATH is not defined.
# e440aed9	12-Apr-2008	Qing Li <qingli@FreeBSD.org>	This patch provides the back end support for equal-cost multi-path (ECMP) for both IPv4 and IPv6. Previously, multipath route insertion is disallowed. For example, route add -net 192.103.54.0/24 10.9.44.1 route add -net 192.103.54.0/24 10.9.44.2 The second route insertion will trigger an error message of "add net 192.103.54.0/24: gateway 10.2.5.2: route already in table" Multiple default routes can also be inserted. Here is the netstat output: default 10.2.5.1 UGS 0 3074 bge0 => default 10.2.5.2 UGS 0 0 bge0 When multipath routes exist, the "route delete" command requires a specific gateway to be specified or else an error message would be displayed. For example, route delete default would fail and trigger the following error message: "route: writing to routing socket: No such process" "delete net default: not in table" On the other hand, route delete default 10.2.5.2 would be successful: "delete net default: gateway 10.2.5.2" One does not have to specify a gateway if there is only a single route for a particular destination. I need to perform more testings on address aliases and multiple interfaces that have the same IP prefixes. This patch as it stands today is not yet ready for prime time. Therefore, the ECMP code fragments are fully guarded by the RADIX_MPATH macro. Include the "options RADIX_MPATH" in the kernel configuration to enable this feature. Reviewed by: robert, sam, gnn, julian, kmacy
# 1951e633	13-Feb-2008	John Baldwin <jhb@FreeBSD.org>	Use RTFREE_LOCKED() instead of rtfree() when releasing a reference on the 'rt' route in rtredirect() as 'rt' is always locked. MFC after: 1 week PR: kern/117913 Submitted by: Stefan Lambrev stefan.lambrev of moneybookers.com
# f321ff15	27-Dec-2007	Maxime Henrion <mux@FreeBSD.org>	Add a workaround for a deadlock between the rt_setgate() and rt_check() functions. It is easily triggered by running routed, and, I expect, by running any other daemon that uses routing sockets. Reviewed by: net@ MFC after: 1 week
# 29910a5a	17-Dec-2007	Kip Macy <kmacy@FreeBSD.org>	widen the routing event interface (arp update, redirect, and eventually pmtu change) into separate functions revert previous commit's changes to arpresolve and add a new interface arpresolve2 which does arp resolution without an mbuf
# 8e7e854c	12-Dec-2007	Kip Macy <kmacy@FreeBSD.org>	add interface for allowing consumers to register for ARP updates, redirects, and path MTU changes Reviewed by: silby
# bf3ce91a	06-Dec-2007	Julian Elischer <julian@FreeBSD.org>	No need to assert that a == b when we just set a = b.
# 21b415b2	22-Oct-2007	John Baldwin <jhb@FreeBSD.org>	Close a race when trying to lookup a gateway route in rt_check(). Specifically, if two threads were doing concurrent lookups and the existing gateway was marked down, the the first thread would drop a reference on the gateway route and then unlock the "root" route while it tried to allocate a new route. The second thread could then also drop a reference on the same gateway route resulting in a reference underflow. Fix this by clearing the gateway route pointer after dropping the reference count but before dropping the lock. Secondly, in this same case, the second thread would overwrite the gateway route pointer w/o free'ing a reference to the route installed by the first thread. In practice this would probably just fix a lost reference that would result in a route never being freed. This fixes panics observed in rt_check() and rtexpunge(). MFC after: 1 week PR: kern/112490 Insight from: mehuljv at yahoo.com Reviewed by: ru (found the "not-setting it to NULL" part) Tested by: several
# 335fbc46	10-Jun-2007	Poul-Henning Kamp <phk@FreeBSD.org>	Add missing \n to printf
# a0c0e34b	22-May-2007	Gleb Smirnoff <glebius@FreeBSD.org>	Some minor cleanups: - In rt_check() remove the senderr() macro and the "bad" label. They used to simplify code, but now aren't. - Remove extra RT_LOCK_ASSERT() in rt_setgate(). The RT_REMREF macro does this. - In rtfree() convert panics to KASSERTs. - Strict the routing API: rtfree() should be called only in a case when we are completely sure we've got the last reference on the rtentry. In all other cases RTFREE_LOCKED() macro should be used. If the reference isn't the last one spit out a warning printf. Correct the only(?) case for this in rt_check(). - Fix typos in comments.
# 6f5967c08	22-Nov-2006	Bruce Evans <bde@FreeBSD.org>	Initialize a local variable in 2 places just before it is used, not always at the start of rtalloc1(). This backs out part of revs 1.83 and 1.85. Profiling on an i386 showed that that for sending tiny packets using bge, -current takes 7 bzero()s where RELENG_4 takes only 1, and that bzero()ing is now the dominant overhead (10-12%, up from 1%, but profiling overestimated this a bit). This commit backs out 2 of the 6 extra bzero()s (1 in each of 2 calls per packet to rtalloc1()). They were the largest ones by byte count (48 bytes each) but perhaps not by time (small misaligned ones might take longer).
# 1a41f910	05-Jun-2006	Qing Li <qingli@FreeBSD.org>	Assuming the interface has an address of x.x.x.195, a mask of 255.255.255.0, and a default route with gateway x.x.x.1. Now if the address mask is changed to something more specific, e.g., 255.255.255.128, then after the mask change the default gateway is no longer reachable. Since the default route is still present in the routing table, when the output code tries to resolve the address of the default gateway in function rt_check(), again, the default route will be returned by rtalloc1(). Because the lock is currently held on the rtentry structure, one more attempt to hold the lock will trigger a crash due to "lock recursed on non-recursive mutex ..." This is a general problem. The fix checks for the above condition so that an existing route entry is not mistaken for a new cloned route. Approriately, an ENETUNREACH error is returned back to the caller Approved by: andre
# e034e82c	16-May-2006	Qing Li <qingli@FreeBSD.org>	The current routing code allows insertion of indirect routes that have gateways which are unreachable except through the default router. For example, assuming there is a default route configured, and inserting a route "route add 64.102.54.0/24 60.80.1.1" is currently allowed even when 60.80.1.1 is only reachable through the default route. However, an error is thrown when this route is utilized, say, "ping 64.102.54.1" will return an error This type of route insertion should be disallowed becasue: 1) Let's say that somehow our code allowed this packet to flow to the default router, and the default router knows the next hop is 60.80.1.1, then the question is why bother inserting this route in the 1st place, just simply use the default route. 2) Since we're not talking about source routing here, the default router could very well choose a different path than using 60.80.1.1 for the next hop, again it defeats the purpose of adding this route. Reviewed by: ru, gnn, bz Approved by: andre
# ac4a76eb	04-May-2006	Bjoern A. Zeeb <bz@FreeBSD.org>	In rtrequest and rtinit check for sa_len != 0 for the given destination. These checks are needed so we do not install a route looking like this: (0) 192.0.2.200 UH tun0 => When removing this route the kernel will start to walk the address space which looks like a hang on 64bit platforms because it'll take ages while on 32bit you should see a panic when kernel debugging options are turned on. The problem is in rtrequest1: if (netmask) { rt_maskedcopy(dst, ndst, netmask); } else bcopy(dst, ndst, dst->sa_len); In both cases the len might be 0 if the application forgot to set it. If so ndst will be all-zero leading to above mentioned strange routes. This is an application error but we must not fail/hang/panic because of this. Looks ok: gnn No objections: net@ (silence) MFC after: 8 weeks
# 4a0d6638	11-Nov-2005	Ruslan Ermilov <ru@FreeBSD.org>	- Store pointer to the link-level address right in "struct ifnet" rather than in ifindex_table[]; all (except one) accesses are through ifp anyway. IF_LLADDR() works faster, and all (except one) ifaddr_byindex() users were converted to use ifp->if_addr. - Stop storing a (pointer to) Ethernet address in "struct arpcom", and drop the IFP2ENADDR() macro; all users have been converted to use IF_LLADDR() instead.
# 2d7e9ead	21-Sep-2005	Gleb Smirnoff <glebius@FreeBSD.org>	Several fixes to rt_setgate(), that fix problems with route changing: - Rearrange code so that in a case of failure the affected route is not changed. Otherwise, a bogus rtentry will be left and later rt_check() can recurse on its lock. [1] - Remove comment about protocol cloning. - Fix two places where rtentry mutex was recursed on, because accessed via two different pointers, that were actually pointing to the same rtentry in some cases. [1] - Return EADDRINUSE instead of bogus EDQUOT, in case when gateway uses the same route. [2] Reported & tested by: ps, Andrej Zverev <az inec.ru> [1] PR: kern/64090 [2]
# fe53256d	19-Sep-2005	Andre Oppermann <andre@FreeBSD.org>	Use monotonic 'time_uptime' instead of 'time_second' as timebase for rt->rt_rmx.rmx_expire.
# 530f95fc	11-Aug-2005	Gleb Smirnoff <glebius@FreeBSD.org>	o Make rt_check() function more strict: - rt0 passed to rt_check() must not be NULL, assert this. - rt returned by rt_check() must be valid locked rtentry, if no error occured. o Modify callers, so that they never pass NULL rt0 to rt_check(). Reviewed by: sam, ume (nd6.c)
# 9bd8ca30	09-Aug-2005	Gleb Smirnoff <glebius@FreeBSD.org>	In preparation for fixing races in ARP (and probably in other L2/L3 mappings) make rt_check() return a locked rtentry.
# 16a2e0a6	28-Jun-2005	Qing Li <qingli@FreeBSD.org>	Require gateways for routes to be of the same address family as the route itself. It fixes a bug where an IPv4 route for example has an IPv6 gateway specified: route add 10.1.1.1 -inet6 fe80::1%fxp0 Destination Gateway Flags Refs Use Netif Expire 10.1.1.1 fe80::1%fxp0 UGHS 0 0 fxp0 The fix rejects these illegal combinations: route: writing to routing socket: Invalid argument add host 10.1.1.1: gateway fe80::1%fxp0: Invalid argument Reviewed by: KAME jinmei@isl.rdc.toshiba.co.jp Reviewed by: andre (mentor) Approved by: re MFC after: 5
# c398230b	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for license, minor formatting changes
# 5090559b	21-Aug-2004	Christian S.J. Peron <csjp@FreeBSD.org>	When a prison is given the ability to create raw sockets (when the security.jail.allow_raw_sockets sysctl MIB is set to 1) where privileged access to jails is given out, it is possible for prison root to manipulate various network parameters which effect the host environment. This commit plugs a number of security holes associated with the use of raw sockets and prisons. This commit makes the following changes: - Add a comment to rtioctl warning developers that if they add any ioctl commands, they should use super-user checks where necessary, as it is possible for PRISON root to make it this far in execution. - Add super-user checks for the execution of the SIOCGETVIFCNT and SIOCGETSGCNT IP multicast ioctl commands. - Add a super-user check to rip_ctloutput(). If the calling cred is PRISON root, make sure the socket option name is IP_HDRINCL, otherwise deny the request. Although this patch corrects a number of security problems associated with raw sockets and prisons, the warning in jail(8) should still apply, and by default we should keep the default value of security.jail.allow_raw_sockets MIB to 0 (or disabled) until we are certain that we have tracked down all the problems. Looking forward, we will probably want to eliminate the references to curthread. This may be a MFC candidate for RELENG_5. Reviewed by: rwatson Approved by: bmilekic (mentor)
# 2dc1d581	11-Aug-2004	Andre Oppermann <andre@FreeBSD.org>	Convert the routing table to use an UMA zone for rtentries. The zone is called "rtentry". This saves a considerable amount of kernel memory. R_Zmalloc previously used 256 byte blocks (plus kmalloc overhead) whereas UMA only needs 132 bytes. Idea from: OpenBSD
# 445e045b	28-Jul-2004	Alexander Kabaev <kan@FreeBSD.org>	Avoid casts as lvalues.
# 490b9d88	24-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	fix one typo and remove one wrong line
# 76927022	24-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	Correct and extend the description of the behaviour of rt_check().
# d6941ce9	21-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	Clearly comment the assumptions that allow us to cast a 'struct radix_node ' to a 'struct rtentry ' in this code, and introduce a macro, RNTORT(), to do this type conversion.
# 85911824	20-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	Fix the initial check for NULL arguments in rtfree (previously it checked for rt == NULL after dereferencing the pointer). We never check for those events elsewhere, so probably these checks might go away here as well. Slightly simplify (and document) the logic for memory allocation in rt_setgate(). The rest is mostly style changes -- replace 0 with NULL where appropriate, remove the macro SA() that was only used once, remove some useless debugging code in rt_fixchange, explain some odd-looking casts.
# 1838a647	18-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	replace Bcopy with bcopy as in the rest of the file.
# 2eb5613f	17-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	make route_init() static
# 9b98ee2c	16-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	Consistently use ifaddr_byindex() to access the link-level address of an interface. No functional change. On passing, comment a likely bug in net/rtsock.c:sysctl_ifmalist() which, if confirmed, would deserve to be fixed and MFC'ed
# e74642df	13-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	route.h: introduce a macro, SA_SIZE(struct sockaddr *) which returns the space occupied by a struct sockaddr when passed through a routing socket. Use it to replace the macro ROUNDUP(int), that does the same but is redefined by every file which uses it, courtesy of the School of Cut'n'Paste Programming(TM). (partial) userland changes to follow.
# 5aca0b30	12-Apr-2004	Luigi Rizzo <luigi@FreeBSD.org>	in rtinit(), remove one useless variable, and move a few others within the block where they are used.
# f36cfd49	07-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson
# d4b2657f	07-Jan-2004	Sam Leffler <sam@FreeBSD.org>	Remove extraneous unlock. This fixes a panic seen when manipulating static entries in the ARP table.
# e21afc60	07-Dec-2003	Sam Leffler <sam@FreeBSD.org>	bandaid LOR in rt_setgate; a proper fix requires code refactoring
# 72b9c8c9	25-Nov-2003	Sam Leffler <sam@FreeBSD.org>	workaround LOR in rt_setgate Reviewed by: andre Approved by: re (rwatson)
# 26d02ca7	20-Nov-2003	Andre Oppermann <andre@FreeBSD.org>	Remove RTF_PRCLONING from routing table and adjust users of it accordingly. The define is left intact for ABI compatibility with userland. This is a pre-step for the introduction of tcp_hostcache. The network stack remains fully useable with this change. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
# 7138d65c	08-Nov-2003	Sam Leffler <sam@FreeBSD.org>	replace explicit changes to rt_refcnt by RT_ADDREF and RT_REMREF macros that expand to include assertions when the system is built with INVARIANTS Supported by: FreeBSD Foundation
# 9c63e9db	30-Oct-2003	Sam Leffler <sam@FreeBSD.org>	Overhaul routing table entry cleanup by introducing a new rtexpunge routine that takes a locked routing table reference and removes all references to the entry in the various data structures. This eliminates instances of recursive locking and also closes races where the lock on the entry had to be dropped prior to calling rtrequest(RTM_DELETE). This also cleans up confusion where the caller held a reference to an entry that might have been reclaimed (and in some cases used that reference). Supported by: FreeBSD Foundation
# 319de71e	29-Oct-2003	Sam Leffler <sam@FreeBSD.org>	avoid recursive lock panic by unlocking before calling rtrequest; this is consistent with other places but will be replaced shortly by a "proper fix" Supported by: FreeBSD Foundation Pain felt by: Jiri Mikulas
# ea045210	16-Oct-2003	Sam Leffler <sam@FreeBSD.org>	Correct handling of cloning loop avoidance: rtalloc1 may return a null pointer in which case we should not do the unlock. Supported by: FreeBSD Foundatin
# 3299a156	10-Oct-2003	Sam Leffler <sam@FreeBSD.org>	fix braino: null the pointer who's memory we just free'd, not some other pointers that are (potentially) used later
# 3e6a836e	07-Oct-2003	Sam Leffler <sam@FreeBSD.org>	insure local variable is initialized prior to use
# 4de5d90c	05-Oct-2003	Sam Leffler <sam@FreeBSD.org>	fix typo that caused a panic when processing an ICMP redirect Sponsored by: FreeBSD Foundation
# d1dd20be	03-Oct-2003	Sam Leffler <sam@FreeBSD.org>	Locking for updates to routing table entries. Each rtentry gets a mutex that covers updates to the contents. Note this is separate from holding a reference and/or locking the routing table itself. Other/related changes: o rtredirect loses the final parameter by which an rtentry reference may be returned; this was never used and added unwarranted complexity for locking. o minor style cleanups to routing code (e.g. ansi-fy function decls) o remove the logic to bump the refcnt on the parent of cloned routes, we assume the parent will remain as long as the clone; doing this avoids a circularity in locking during delete o convert some timeouts to MPSAFE callouts Notes: 1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level applications cannot/do-no know about mutex's. Doing this requires that the mutex be the last element in the structure. A better solution is to introduce an externalized version of struct rtentry but this is a major task because of the intertwining of rtentry and other data structures that are visible to user applications. 2. There are known LOR's that are expected to go away with forthcoming work to eliminate many held references. If not these will be resolved prior to release. 3. ATM changes are untested. Sponsored by: FreeBSD Foundation Obtained from: BSD/OS (partly)
# becc44d7	03-Oct-2003	Sam Leffler <sam@FreeBSD.org>	cleanups prior to adding locking (and in some cases to eliminate locking): o move route_cb to be private to rtsock.c o replace global static route_proto by locals o eliminate global #define shorthands for info references o remove some register decls o ansi-fy function decls o move items to be close in scope to their usage o add rt_dispatch function for dispatching the actual message o cleanup tangled logic for doing all-but-me msg send Support by: FreeBSD Foundation
# 983985c1	13-Apr-2003	Jeffrey Hsu <hsu@FreeBSD.org>	No need to unlock if error detected before locking. Submitted by: harti
# 7f760c48	02-Mar-2003	Matthew N. Dodd <mdodd@FreeBSD.org>	Reduce code duplication. This adds the function rt_check() to route.c. Approved by: sam (in principle)
# a163d034	18-Feb-2003	Warner Losh <imp@FreeBSD.org>	Back out M_* changes, per decision of the TRB. Approved by: trb
# 44956c98	21-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 94e013f0	25-Dec-2002	Ruslan Ermilov <ru@FreeBSD.org>	I'm not sure what was the problem at the time of revision 1.37 when julian@ added it, but the commented out code had at least one bug -- not freeing the allocated mbuf. Anyway, this comment no longer applies as of revision 1.67, so remove it.
# 42e9e16d	25-Dec-2002	Ruslan Ermilov <ru@FreeBSD.org>	Revision 1.67 changes correspond to CSRG revision 8.3.1.1 changes.
# 71eba915	25-Dec-2002	Ruslan Ermilov <ru@FreeBSD.org>	If the caller of rtrequest*(RTM_DELETE, ...) asked for a copy of the entry being removed (ret_nrt != NULL), increment the entry's rt_refcnt like we do it for RTM_ADD and RTM_RESOLVE, rather than messing around with 1->0 transitions for rtfree() all over.
# 956b0b65	23-Dec-2002	Jeffrey Hsu <hsu@FreeBSD.org>	SMP locking for radix nodes.
# 36fea5de	23-Dec-2002	Ruslan Ermilov <ru@FreeBSD.org>	rn_walktree*() compute the next leaf before applying a function to current leaves because function may vanish the current node. If parent RTA_GENMASK route has a clone (a "cloning clone"), an rn_walktree_from() starting from parent will cause another walk starting from clone. If a function is either rt_fixdelete() or rt_fixchange(), this recursive walk may vanish the leaf that is remembered by an outer walk (the "next leaf" above), panicing a system when it resumes with an outer walk. The following script paniced my single-user mode booted system: : sysctl net.inet.ip.forwarding=1 : ipfw add 1 allow ip from any to any : ifconfig lo0 127.1 : route add -net 10 -genmask 255.255.255.0 127.1 : telnet 10.1 # rt_fixchange() panic : telnet 10.2 : telnet 10.1 : route delete -net 10 # rt_fixdelete() panic For the time being, avoid these races by disallowing recursive walks in rt_fixchange() and rt_fixdelete(). Also, make a slight optimization in the rtrequest(RTM_RESOLVE) case: there is no reason to call rt_fixchange() in this case. PR: kern/37606 MFC after: 5 days
# 19fc74fb	18-Dec-2002	Jeffrey Hsu <hsu@FreeBSD.org>	Lock up ifaddr reference counts.
# bbb4330b	15-Nov-2002	Luigi Rizzo <luigi@FreeBSD.org>	Massive cleanup of the ip_mroute code. No functional changes, but: + the mrouting module now should behave the same as the compiled-in version (it did not before, some of the rsvp code was not loaded properly); + netinet/ip_mroute.c is now truly optional; + removed some redundant/unused code; + changed many instances of '0' to NULL and INADDR_ANY as appropriate; + removed several static variables to make the code more SMP-friendly; + fixed some minor bugs in the mrouting code (mostly, incorrect return values from functions). This commit is also a prerequisite to the addition of support for PIM, which i would like to put in before DP2 (it does not change any of the existing APIs, anyways). Note, in the process we found out that some device drivers fail to properly handle changes in IFF_ALLMULTI, leading to interesting behaviour when a multicast router is started. This bug is not corrected by this commit, and will be fixed with a separate commit. Detailed changes: -------------------- netinet/ip_mroute.c all the above. conf/files make ip_mroute.c optional net/route.c fix mrt_ioctl hook netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here together with other rsvp code, and a couple of indentation fixes. netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks netinet/ip_var.h rsvp function hooks netinet/raw_ip.c hooks for mrouting and rsvp functions, plus interface cleanup. netinet/ip_mroute.h remove an unused and optional field from a struct Most of the code is from Pavlin Radoslavov and the XORP project Reviewed by: sam MFC after: 1 week
# 54e84abb	30-May-2002	Mike Silbersack <silby@FreeBSD.org>	Ensure that packet counts are always reset to 0 when a route is cloned. Previously, they took on the count of their parent route (which was sometimes nonzero.) Submitted by: Andre Oppermann <oppermann@pipeline.ch> MFC after: 5 days
# 929ddbbb	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# 6f99b44c	28-Nov-2001	Brian Somers <brian@FreeBSD.org>	Fix a typo in a comment
# 8071913d	17-Oct-2001	Ruslan Ermilov <ru@FreeBSD.org>	Pull post-4.4BSD change to sys/net/route.c from BSD/OS 4.2. Have sys/net/route.c:rtrequest1(), which takes ``rt_addrinfo '' as the argument. Pass rt_addrinfo all the way down to rtrequest1 and ifa->ifa_rtrequest. 3rd argument of ifa->ifa_rtrequest is now ``rt_addrinfo '' instead of ``sockaddr '' (almost noone is using it anyways). Benefit: the following command now works. Previously we needed two route(8) invocations, "add" then "change". # route add -inet6 default ::1 -ifp gif0 Remove unsafe typecast in rtrequest(), from ``rtentry '' to ``sockaddr *''. It was introduced by 4.3BSD-Reno and never corrected. Obtained from: BSD/OS, NetBSD MFC after: 1 month PR: kern/28360
# 4862bf8c	17-Oct-2001	Ruslan Ermilov <ru@FreeBSD.org>	64-bit fixes from CSRG.
# 66953138	15-Oct-2001	Ruslan Ermilov <ru@FreeBSD.org>	Don't even attempt to clone host routes. MFC after: 1 week
# c3cb7e5d	25-Jul-2001	Bill Fenner <fenner@FreeBSD.org>	Don't bother passing p to rtioctl just so it can fail to pass it to mrt_ioctl
# 9a701516	25-Jul-2001	Hajimu UMEMOTO <ume@FreeBSD.org>	As commented in defined in sys/net/route.c, rt_fixchange() has a bad effect, which would cause unnecessary route deletion: * Unfortunately, this has the obnoxious * property of also triggering for insertion /above/ a pre-existing network * route and clones. Sigh. This may be fixed some day. The effect has been even worse, because recent versions of route.c set the parent rtentry for cloned routes from an interface-direct route. For example, suppose that we have an interface "ne0" that has an IPv4 subnet "10.0.0.0/24". Then we may have a cloned route like 10.0.0.1 on the interface, whose parent route is 10.0.0.0/24 (to the interface ne0). Now, when we add the default route (i.e. 0.0.0.0/0), rt_fixchange() will remove the cloned route 10.0.0.1. The (bad) effect also prevents rt_setgate from configuring rt_gwroute, which would not be an intended behavior. As suggested in the comments to rt_fixchange(), we need stricter check in the function, to prevent unintentional route deletion. This fix also solve the "IPV6 panic?" problem in nd6_timer(). Submitted by: JINMEI Tatuya <jinmei@isl.rdc.toshiba.co.jp> MFC after: 4 days
# ffdc316d	04-Jun-2001	Ruslan Ermilov <ru@FreeBSD.org>	When looking for an interface appropriate for the (new or changing) route in ifa_ifwithroute(), as the last resort, look up the route to the gateway, not destination (to derive the interface from). PR: kern/27852 Submitted by: Iasen Kostoff <tbyte@tbyte.org> MFC after: 2 weeks
# 089cdfad	15-Mar-2001	Ruslan Ermilov <ru@FreeBSD.org>	net/route.c: A route generated from an RTF_CLONING route had the RTF_WASCLONED flag set but did not have a reference to the parent route, as documented in the rtentry(9) manpage. This prevented such routes from being deleted when their parent route is deleted. Now, for example, if you delete an IP address from a network interface, all ARP entries that were cloned from this interface route are flushed. This also has an impact on netstat(1) output. Previously, dynamically created ARP cache entries (RTF_STATIC flag is unset) were displayed as part of the routing table display (-r). Now, they are only printed if the -a option is given. netinet/in.c, netinet/in_rmx.c: When address is removed from an interface, also delete all routes that point to this interface and address. Previously, for example, if you changed the address on an interface, outgoing IP datagrams might still use the old address. The only solution was to delete and re-add some routes. (The problem is easily observed with the route(8) command.) Note, that if the socket was already bound to the local address before this address is removed, new datagrams generated from this socket will still be sent from the old address. PR: kern/20785, kern/21914 Reviewed by: wollman (the idea)
# 1a11e63e	22-Apr-2000	Garrett Wollman <wollman@FreeBSD.org>	A couple months ago, Kirk and I were doing a walkthrough of the radix-tree search routine, and scratching our heads over why it was so obfuscated. This delta fixes a number of confusing style bugs and renames several structure members to have more meaningful names. There remain a number of odd control-flow structures. These changes do not affect the generated code.
# 66810dd0	15-Jan-2000	Yoshinobu Inoue <shin@FreeBSD.org>	Clear ro->ro_rt just after RTFREE(). Pleases let me make sure that no one touch the invalid ro_rt pointer, after splx(s) and before next ro_rt initialization. Though usually this seems to be already called at splnet, I still sometime experience kernel crash at rtfree() in my INET6 enabled environment where IPv6 connection is frequently used. (Off-course, it might be just due to another bug.)
# 6a800098	22-Dec-1999	Yoshinobu Inoue <shin@FreeBSD.org>	IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 68f956b8	09-Dec-1999	John Polstra <jdp@FreeBSD.org>	Fix a route table leak in rtalloc() and rtalloc_ign(). It is possible for ro->ro_rt to be non-NULL even though the RTF_UP flag is cleared. (Example: a routing daemon or the "route" command deletes a cloned route in active use by a TCP connection.) In that case, the code was clobbering a reference to the routing table entry without decrementing the entry's reference count. The splnet() call probably isn't needed, but I haven't been able to prove that yet. It isn't significant from a performance standpoint since it is executed very rarely. Reviewed by: wollman and others in the freebsd-current mailing list
# ae5bcbff	09-Dec-1999	Yoshinobu Inoue <shin@FreeBSD.org>	rtcalloc() is removed because it turned out not to be necessary for FreeBSD. (It was added as a part of KAME patch) Specified by: jdp@polstra.com
# a86ab817	23-Nov-1999	Brian Somers <brian@FreeBSD.org>	Only emit the ``wrong ifa'' message if the matching interface is neither IFF_LOOPBACK or IFF_POINTOPOINT. It's quite common (and probably more correct) to route local IP numbers via lo0 and it makes configuration easier to assign the hostname address to local POINTOPOINT links too. This message usually remains hidden because the loopback interface gets the highest interface number at boot time, but when the ethernet interface is added later, the message can get pretty annoying. Also, fix a typo. Not objected to by: freebsd-net
# 82cd038d	21-Nov-1999	Yoshinobu Inoue <shin@FreeBSD.org>	KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP for IPv6 yet) With this patch, you can assigne IPv6 addr automatically, and can reply to IPv6 ping. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# cb64988f	28-Apr-1999	Luoqi Chen <luoqi@FreeBSD.org>	Postpone route_init() until all domains are attached.
# 831a80b0	27-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# dc733423	17-Apr-1998	Dag-Erling Smørgrav <des@FreeBSD.org>	Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.
# 303b270b	08-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Staticize.
# 0b08f5f7	05-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Back out DIAGNOSTIC changes.
# 47cfdb16	04-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Turn DIAGNOSTIC into a new-style option.
# 1d5e9e22	08-Jan-1998	Eivind Eklund <eivind@FreeBSD.org>	Make INET a proper option. This will not make any of object files that LINT create change; there might be differences with INET disabled, but hardly anything compiled before without INET anyway. Now the 'obvious' things will give a proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The only thing that _should_ work (but can't be made to compile reasonably easily) is sppp :-( This commit move struct arpcom from <netinet/if_ether.h> to <net/if_arp.h>.
# 55b211e3	28-Oct-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# 514ede09	16-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Fixed gratuitous ANSIisms.
# 4d1d4912	01-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Added used #include - don't depend on <sys/mbuf.h> including <sys/malloc.h> (unless we only use the bogusly shared M*WAIT flags).
# fce002fd	24-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include it when it is not used. In most cases, the reasons for including it went away when the special ioctl headers became self-sufficient.
# 499676df	05-Mar-1997	Julian Elischer <julian@FreeBSD.org>	add a bunch of comments to describe what's going on. This is some of the worst code I've had to wade through in ages and I don't want to have to start from scratch again next time. (I have a 2.2 version of these comments, can I commit them?)
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# d57d661c	25-Jan-1997	Julian Elischer <julian@FreeBSD.org>	fix mixleading comment (my error.. I wrote the comment)
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# b0a76b88	10-Sep-1996	Julian Elischer <julian@FreeBSD.org>	No code changes what so ever, but added about 150 lines of comments Sorry if this makes it harder to merge in lite2 stuff but hey.. At least I can figure out what is going on whenever I end up going through those files again.. do we have a policy regarding commenting existing code?
# 704b0666	01-Sep-1996	Bill Fenner <fenner@FreeBSD.org>	Bugfix and simplification for rev 1.34: make sure that the route is non-null before trying to delete it in rt_setgate(), which then allows removal of the special-case code from the RTM_ADD case. This should fix the panics that joerg and Phil Karn have been seeing.
# 3271a3a4	23-Aug-1996	Peter Wemm <peter@FreeBSD.org>	route.c:RTM_ADD does not check for a netmask before doing a tree walk like it does elsewhere. This is probably only happens when incorrect args are given to route(8), or when running with non-IPv4 stacks but incorrect args to the route command is no excuse for panicing! Submitted by: Michael Clay <mclay@weareb.org>, PR#1532
# 1db1fffa	09-Jul-1996	Bill Fenner <fenner@FreeBSD.org>	Disallow host routes that point to themselves. These routes serve no purpose, other than to get in the way of the ARP table and cause "can't allocate llinfo" errors. This change may cause gated or routed to start complaining when adding such routes. If so, these programs will need to be fixed to not try to add these routes. Reviewed by: wollman
# 6ac3b69d	29-Mar-1996	Bill Fenner <fenner@FreeBSD.org>	Eliminate panic("rtfree") caused by double-freeing the route when rt == rt->rt_gwroute . rt == rt->gwroute shouldn't happen in the first place, but that's another problem. (try "route add -host <hostonmynet> <hostonmynet>; ping <hostonmynet>; route delete <hostonmynet>")
# 2ee45d7d	11-Mar-1996	David Greenman <dg@FreeBSD.org>	Move or add #include <queue.h> in preparation for upcoming struct socket changes.
# 4bd49128	02-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Add more options into the conf/options and i386/conf/options.i386 files and the #include hooks so that 'make depend' is more useful. This covers most of the options I regularly use (but not all) and some other easy ones.
# fde327d6	24-Jan-1996	Garrett Wollman <wollman@FreeBSD.org>	Fix memory leak in case of adding a host route on top of another one. Pointed-out-by: Bill Fenner <fenner@parc.xerox.com>
# f708ef1b	14-Dec-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Another mega commit to staticize things.
# af32e59f	02-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Fixed call to mrt_ioctl(). mrt_ioctl() for some reason has different number of args when MROUTING is defined.
# a98ca469	29-Oct-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Second batch of cleanup changes. This time mostly making a lot of things static and some unused variables here and there.
# aca1a47c	16-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	When adding a route fails because there is already a route with the same (mask,value) in the tree, don't immediately return EEXIST. Instead, check to see if the pre-existing route was generated by protcol-cloning. If so, then it is OK to simply blow away the old route and re-attempt the insertion. If not, then fall back to the same error code as before.
# 9e52b982	03-Oct-1995	Garrett Wollman <wollman@FreeBSD.org>	Import of 4.4-Lite-2 sys/net to make merge and examination easier. Since we are not on the vendor branch for any of these files, the conflicts shown make no matter. Obtained from: 4.4BSD-Lite-2
# 28f8db14	29-Jul-1995	Bruce Evans <bde@FreeBSD.org>	Eliminate sloppy common-style declarations. There should be none left for the LINT configuation.
# 8e718bb4	10-Jul-1995	Garrett Wollman <wollman@FreeBSD.org>	When adding a route, set rt_ifa and rt_ifp a little earlier so that the protocol-specific add routine can examine it if desired.
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# cd02a0b7	25-Apr-1995	Garrett Wollman <wollman@FreeBSD.org>	Finally finish the cloning cleanup work by making sure that clones go away whenever a clone's parent is changed, or a route is added in a certain set of circumstances. This also includes code to forbid setting a route's gateway to an address which can only be reached through that route, thus (hopefully) eliminating one class of cloning bottomless-recursion bugs.
# a29ae2a1	24-Mar-1995	Garrett Wollman <wollman@FreeBSD.org>	Don't delete clones if they are PINNED.
# 3545b048	23-Mar-1995	Garrett Wollman <wollman@FreeBSD.org>	radix.c: correct exit condition in rn_walktree_from() route.c: be a little more careful when running deleting children of dying . routes
# 771edb14	21-Mar-1995	Garrett Wollman <wollman@FreeBSD.org>	Protocol-cloned routes should gain a reference to their parents to make sure that rt->rt_parent values can never be re-used harmfully.
# 3682d2ba	20-Mar-1995	David Greenman <dg@FreeBSD.org>	Made minor readability tweak.
# c2bed6a3	20-Mar-1995	Garrett Wollman <wollman@FreeBSD.org>	Better fix for the deletion of parents of cloned routes problem, superseding the `nextchild' hack. This also provides a way forward to fix RTM_CHANGE and RTM_ADD as well.
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 3ec66d6c	23-Jan-1995	David Greenman <dg@FreeBSD.org>	Added back the missing last few bytes of the file.
# 18e1f1f1	22-Jan-1995	Garrett Wollman <wollman@FreeBSD.org>	route.c: keep track of where cloned routes come from, and make sure to delete them when the ``parent'' goes away route.h: add glue to track this to rtentry structure. WARNING WILL ROBINSON! This will be yet another incompatible change in your route-using binaries. I apologize, but this was the only way to do it. I took this opportunity to increase the size of the metrics to what I believe will be the final length for 2.1, so that when the T/TCP stuff is done, this won't happen again.
# 652082e6	13-Dec-1994	Garrett Wollman <wollman@FreeBSD.org>	Implemented rtalloc_ign().
# 995add1a	13-Dec-1994	Garrett Wollman <wollman@FreeBSD.org>	Add support for two separate cloning flags, one set by the lower layers, and one set by the protocol family. Also add another parameter to rtalloc1() to allow for any interface flags to be ignored; currently this is only useful for RTF_PRCLONING. Get rid of rt_prflags and re-unite with rt_flags. Add T/TCP ``route metrics''. NB: YOU MUST RECOMPILE `route' AND OTHER RELATED PROGRAMS AS A RESULT OF THIS CHANGE. This also adds a new interface parameter, `ifi_physical', which will eventually replace IFF_ALTPHYS as the mechanism for specifying the particular physical connection desired on a multiple-connection card. NB: YOU MUST RECOMPILE `ifconfig' AND OTHER RELATED PROGRAMS AS A RESULT OF THIS CHANGE.
# f084e014	02-Nov-1994	Garrett Wollman <wollman@FreeBSD.org>	Collapse two fields so that we have space for another 32 flags. NB: You will have to recompile programs which use the `rt_use' member in order to get the correct values. This should not cause incorrect operation, but the statistics may look a little confusing.
# 5c2dae8e	01-Nov-1994	Garrett Wollman <wollman@FreeBSD.org>	Add code to be a bit smarter about IP routes, conditioned on the option IN_RMX. (Eventually this will be standard, but I just wrote the code today and don't want to break anyone.)
# 5df72964	11-Oct-1994	Garrett Wollman <wollman@FreeBSD.org>	Fix a bug which caused panics when attempting to change just the flags of a route. (This still doesn't work, but it doesn't panic now.) It looks like there may be a number of incipient bugs in this code. Also, get ready for the time when all IP gateway routes are cloning, which is necessary to keep proper TCP statistics.
# 623ae52e	02-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	GCC cleanup. Reviewed by: Submitted by: Obtained from:
# 5e9ae478	13-Sep-1994	Garrett Wollman <wollman@FreeBSD.org>	Shuffle some functions and variables around to make it possible for multicast routing to be implemented as an LKM. (There's still a bit of work to do in this area.)
# 545ce3ae	07-Sep-1994	Garrett Wollman <wollman@FreeBSD.org>	The mrt_ioctl goop properly depends on MROUTING, not MULTICAST. (Oof!)
# e4ca4481	07-Sep-1994	Stefan Eßer <se@FreeBSD.org>	Reviewed by: Stefan Esser Submitted by: rtioctl(): changed parameter to mrt_ioctl from "cmd" to "req" to make it compile with MULTICAST defined.
# f0068c4a	06-Sep-1994	Garrett Wollman <wollman@FreeBSD.org>	Initial get-the-easy-case-working upgrade of the multicast code to something more recent than the ancient 1.2 release contained in 4.4. This code has the following advantages as compared to previous versions (culled from the README file for the SunOS release): - True multicast delivery - Configurable rate-limiting of forwarded multicast traffic on each physical interface or tunnel, using a token-bucket limiter. - Simplistic classification of packets for prioritized dropping. - Administrative scoping of multicast address ranges. - Faster detection of hosts leaving groups. - Support for multicast traceroute (code not yet available). - Support for RSVP, the Resource Reservation Protocol. What still needs to be done: - The multicast forwarder needs testing. - The multicast routing daemon needs to be ported. - Network interface drivers need to have the `#ifdef MULTICAST' goop ripped out of them. - The IGMP code should probably be bogon-tested. Some notes about the porting process: In some cases, the Berkeley people decided to incorporate functionality from later releases of the multicast code, but then had to do things differently. As a result, if you look at Deering's patches, and then look at our code, it is not always obvious whether the patch even applies. Let the reader beware. I ran ip_mroute.c through several passes of `unifdef' to get rid of useless grot, and to permanently enable the RSVP support, which we will include as standard. Ported by: Garrett Wollman Submitted by: Steve Deering and Ajit Thyagarajan (among others)
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources