#
285830 |
|
23-Jul-2015 |
gjb |
- Copy stable/10@285827 to releng/10.2 in preparation for 10.2-RC1 builds. - Update newvers.sh to reflect RC1. - Update __FreeBSD_version to reflect 10.2. - Update default pkg(8) configuration to use the quarterly branch.[1]
Discussed with: re, portmgr [1] Approved by: re (implicit) Sponsored by: The FreeBSD Foundation |
#
256281 |
|
10-Oct-2013 |
gjb |
Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
|
#
241913 |
|
22-Oct-2012 |
glebius |
Switch the entire IPv4 stack to keep the IP packet header in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet.
After this change a packet processed by the stack isn't modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack.
[1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
|
#
241575 |
|
15-Oct-2012 |
glebius |
We don't need to convert ip6_len to host byte order before ip6_output(), the IPv6 stack is working in net byte order.
The reason this code worked before is that ip6_output() doesn't look at ip6_plen at all and recalculates it based on mbuf length.
|
#
241394 |
|
10-Oct-2012 |
kevlo |
Revert previous commit...
Pointyhat to: kevlo (myself)
|
#
241370 |
|
09-Oct-2012 |
kevlo |
Prefer NULL over 0 for pointers
|
#
241344 |
|
08-Oct-2012 |
glebius |
After r241245 it appeared that in_delayed_cksum(), which still expects host byte order, was sometimes called with net byte order. Since we are moving towards net byte order throughout the stack, the function was converted to expect net byte order, and its consumers fixed appropriately: - ip_output(), ipfilter(4) not changed, since already call in_delayed_cksum() with header in net byte order. - divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order there and back. - mrouting code and IPv6 ipsec now need to switch byte order there and back, but I hope, this is temporary solution. - In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum(). - pf_route() catches up on r241245 changes to ip_output().
|
#
241342 |
|
08-Oct-2012 |
glebius |
No reason to play with IP header before calling sctp_delayed_cksum() with offset beyond the IP header.
|
#
230452 |
|
22-Jan-2012 |
bz |
Make #error messages string-literals and remove punctuation.
Reported by: bde (for ip_divert) Reviewed by: bde MFC after: 3 days
|
#
230443 |
|
22-Jan-2012 |
bz |
Fix ip_divert handling of inet and inet6 and module building some more.
Properly sort the "carp" case in modules/Makefile after it was renamed.
Reported by: bde (most) Reviewed by: bde MFC after: 3 days
|
#
227309 |
|
07-Nov-2011 |
ed |
Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
|
#
224575 |
|
01-Aug-2011 |
glebius |
Add missing break; in r223593.
Submitted by: sem Pointy hat to: glebius Approved by: re (kib)
|
#
223593 |
|
27-Jun-2011 |
glebius |
Add possibility to pass IPv6 packets to a divert(4) socket.
Submitted by: sem
|
#
222748 |
|
06-Jun-2011 |
rwatson |
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture.
Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x.
Reviewed by: bz Sponsored by: Juniper Networks, Inc.
|
#
222690 |
|
04-Jun-2011 |
rwatson |
IP divert sockets use their inpcbinfo for port reservation, although not for lookup. I missed its call to in_pcbbind() when preparing previous patches, which would lead to a lock assertion failure (although problem not an actual race condition due to global pcbinfo locks providing required synchronisation -- in this particular case only). This change adds the missing locking of the pcbhash lock.
(Existing comments in the ipdivert code question the need for using the global hash to manage the namespace, as really it's a simple port namespace and not an address/port namespace. Also, although in_pcbbind is used to manage reservations, the hash tables aren't used for lookup. It might be a good idea to make them use hashed lookup, or to use a different reservation scheme.)
Reviewed by: bz Reported by: Kristof Provost <kristof at sigsegv.be> Sponsored by: Juniper Networks
|
#
222488 |
|
30-May-2011 |
rwatson |
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required.
A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary.
Reviewed by: bz Sponsored by: Juniper Networks, Inc.
|
#
217554 |
|
18-Jan-2011 |
mdf |
Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need to rely on the format string. For SYSCTL_PROC instances that I noticed a discrepancy between the CTLTYPE and the format specifier, fix the CTLTYPE.
|
#
215701 |
|
22-Nov-2010 |
dim |
After some off-list discussion, revert a number of changes to the DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------ r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined.
------------------------------------------------------------------------ r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
------------------------------------------------------------------------ r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
|
#
215317 |
|
14-Nov-2010 |
dim |
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
|
#
211433 |
|
17-Aug-2010 |
jhb |
Ensure a minimum "slop" of 10 extra pcb structures when providing a memory size estimate to userland for pcb list sysctls. The previous behavior of a "slop" of n/8 does not work well for small values of n (e.g. no slop at all if you have less than 8 open UDP connections).
Reviewed by: bz MFC after: 1 week
|
#
205251 |
|
17-Mar-2010 |
bz |
Add pcb reference counting to the pcblist sysctl handler functions to ensure type stability while caching the pcb pointers for the copyout.
Reviewed by: rwatson MFC after: 7 days
|
#
205157 |
|
14-Mar-2010 |
rwatson |
Abstract out initialization of most aspects of struct inpcbinfo from their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy() to do this work in a central spot. As inpcbinfo becomes more complex due to ongoing work to add connection groups, this will reduce code duplication.
MFC after: 1 month Reviewed by: bz Sponsored by: Juniper Networks
|
#
205104 |
|
12-Mar-2010 |
rrs |
The proper fix for the delayed SCTP checksum is to have the delayed function take an argument as to the offset to the SCTP header. This allows it to work for V4 and V6. This of course means changing all callers of the function to either pass the header len, if they have it, or create it (ip_hl << 2 or sizeof(ip6_hdr)). PR: 144529 MFC after: 2 weeks
|
#
204810 |
|
06-Mar-2010 |
rwatson |
Remove unnecessary locking of divcbinfo lock from div_output(): this has not been required since FreeBSD 7.0 when the so_pcb pointer leading to inp was guaranteed to be stable when a valid socket reference is held (as it is in the output path).
MFC after: 1 week Reviewed by: bz Sponsored by: Juniper Networks
|
#
201735 |
|
07-Jan-2010 |
luigi |
Following up on a request from Ermal Luci to make ip_divert work as a client of pf(4), make ip_divert not depend on ipfw.
This is achieved by moving to ip_var.h the struct ipfw_rule_ref (which is part of the mtag for all reinjected packets) and other declarations of global variables, and moving to raw_ip.c global variables for filter and divert hooks.
Note that names and locations could be made more generic (ipfw_rule_ref is really a generic reference robust to reconfigurations; the packet filter is not necessarily ipfw; filters and their clients are not necessarily limited to ipv4), but _right now_ most of this stuff works on ipfw and ipv4, so i don't feel like doing a gratuitous renaming, at least for the time being.
|
#
201527 |
|
04-Jan-2010 |
luigi |
Various cleanup done in ipfw3-head branch including: - use a uniform mtag format for all packets that exit and re-enter the firewall in the middle of a rulechain. On reentry, all tags containing reinject info are renamed to MTAG_IPFW_RULE so the processing is simpler.
- make ipfw and dummynet use ip_len and ip_off in network format everywhere. Conversion is done only once instead of tracking the format in every place.
- use a macro FREE_PKT to dispose of mbufs. This eases portability.
On passing i also removed a few typos, staticise or localise variables, remove useless declarations and other minor things.
Overall the code shrinks a bit and is hopefully more readable.
I have tested functionality for all but ng_ipfw and if_bridge/if_ethersubr. For ng_ipfw i am actually waiting for feedback from glebius@ because we might have some small changes to make. For if_bridge and if_ethersubr feedback would be welcome (there are still some redundant parts in these two modules that I would like to remove, but first i need to check functionality).
|
#
200580 |
|
15-Dec-2009 |
luigi |
Start splitting ip_fw2.c and ip_fw.h into smaller components. At this time we pull out from ip_fw2.c the logging functions, and support for dynamic rules, and move kernel-only stuff into netinet/ipfw/ip_fw_private.h
No ABI change involved in this commit, unless I made some mistake. ip_fw.h has changed, though not in the userland-visible part.
Files touched by this commit:
conf/files now references the two new source files
netinet/ip_fw.h remove kernel-only definitions gone into netinet/ipfw/ip_fw_private.h.
netinet/ipfw/ip_fw_private.h new file with kernel-specific ipfw definitions
netinet/ipfw/ip_fw_log.c ipfw_log and related functions
netinet/ipfw/ip_fw_dynamic.c code related to dynamic rules
netinet/ipfw/ip_fw2.c removed the pieces that goes in the new files
netinet/ipfw/ip_fw_nat.c minor rearrangement to remove LOOKUP_NAT from the main headers. This require a new function pointer.
A bunch of other kernel files that included netinet/ip_fw.h now require netinet/ipfw/ip_fw_private.h as well. Not 100% sure i caught all of them.
MFC after: 1 month
|
#
196502 |
|
24-Aug-2009 |
zec |
Introduce a div_destroy() function which takes over per-vnet cleanup tasks from the existing modevent / MOD_UNLOAD handler, and register div_destroy() in protosw as per-vnet .pr_destroy() handler for options VIMAGE builds. In nooptions VIMAGE builds, div_destroy() will be invoked from the modevent handler, resulting in effectively identical operation as it was prior this change. div_destroy() also tears down hashtables used by ipdivert, which were previously left behind on ipdivert kldunloads.
For options VIMAGE builds only, temporarily disable kldunloading of ipdivert, because without introducing additional locking logic it is impossible to atomically check whether all ipdivert instances in all vnets are idle, and proceed with cleanup without opening a race window for a vnet to open an ipdivert socket while ipdivert tear-down is in progress.
While here, staticize div_init(), because it is not used outside of ip_divert.c.
In cooperation with: julian Approved by: re (rwatson), julian (mentor) MFC after: 3 days
|
#
196039 |
|
02-Aug-2009 |
rwatson |
Many network stack subsystems use a single global data structure to hold all pertinent statatistics for the subsystem. These structures are sometimes "borrowed" by kernel modules that require a place to store statistics for similar events.
Add KPI accessor functions for statistics structures referenced by kernel modules so that they no longer encode certain specifics of how the data structures are named and stored. This change is intended to make it easier to move to per-CPU network stats following 8.0-RELEASE.
The following modules are affected by this change:
if_bridge if_cxgb if_gif ip_mroute ipdivert pf
In practice, most of these statistics consumers should, in fact, maintain their own statistics data structures rather than borrowing structures from the base network stack. However, that change is too agressive for this point in the release cycle.
Reviewed by: bz Approved by: re (kib)
|
#
196019 |
|
01-Aug-2009 |
rwatson |
Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes.
Reviewed by: bz Approved by: re (vimage blanket)
|
#
195727 |
|
16-Jul-2009 |
rwatson |
Remove unused VNET_SET() and related macros; only VNET_GET() is ever actually used. Rename VNET_GET() to VNET() to shorten variable references.
Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib)
|
#
195699 |
|
14-Jul-2009 |
rwatson |
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)
|
#
195023 |
|
26-Jun-2009 |
rwatson |
Update various IPFW-related modules to use if_addr_rlock()/ if_addr_runlock() rather than IF_ADDR_LOCK()/IF_ADDR_UNLOCK().
MFC after: 6 weeks
|
#
194760 |
|
23-Jun-2009 |
rwatson |
Modify most routines returning 'struct ifaddr *' to return references rather than pointers, requiring callers to properly dispose of those references. The following routines now return references:
ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr
Remove unused macro which didn't have required referencing:
IFP_TO_IA6
This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references.
Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed.
Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions)
|
#
193511 |
|
05-Jun-2009 |
rwatson |
Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
|
#
193332 |
|
02-Jun-2009 |
rwatson |
Add internal 'mac_policy_count' counter to the MAC Framework, which is a count of the number of registered policies.
Rather than unconditionally locking sockets before passing them into MAC, lock them in the MAC entry points only if mac_policy_count is non-zero.
This avoids locking overhead for a number of socket system calls when no policies are registered, eliminating measurable overhead for the MAC Framework for the socket subsystem when there are no active policies.
Possibly socket locks should be acquired by policies if they are required for socket labels, which would further avoid locking overhead when there are policies but they don't require labeling of sockets, or possibly don't even implement socket controls.
Obtained from: TrustedBSD Project
|
#
193219 |
|
01-Jun-2009 |
rwatson |
Reimplement the netisr framework in order to support parallel netisr threads:
- Support up to one netisr thread per CPU, each processings its own workstream, or set of per-protocol queues. Threads may be bound to specific CPUs, or allowed to migrate, based on a global policy.
In the future it would be desirable to support topology-centric policies, such as "one netisr per package".
- Allow each protocol to advertise an ordering policy, which can currently be one of:
NETISR_POLICY_SOURCE: packets must maintain ordering with respect to an implicit or explicit source (such as an interface or socket).
NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work, as well as allowing protocols to provide a flow generation function for mbufs without flow identifers (m2flow). Falls back on NETISR_POLICY_SOURCE if now flow ID is available.
NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for each packet handled by netisr (m2cpuid).
- Provide utility functions for querying the number of workstreams being used, as well as a mapping function from workstream to CPU ID, which protocols may use in work placement decisions.
- Add explicit interfaces to get and set per-protocol queue limits, and get and clear drop counters, which query data or apply changes across all workstreams.
- Add a more extensible netisr registration interface, in which protocols declare 'struct netisr_handler' structures for each registered NETISR_ type. These include name, handler function, optional mbuf to flow ID function, optional mbuf to CPU ID function, queue limit, and ordering policy. Padding is present to allow these to be expanded in the future. If no queue limit is declared, then a default is used.
- Queue limits are now per-workstream, and raised from the previous IFQ_MAXLEN default of 50 to 256.
- All protocols are updated to use the new registration interface, and with the exception of netnatm, default queue limits. Most protocols register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use NETISR_POLICY_FLOW, and will therefore take advantage of driver- generated flow IDs if present.
- Formalize a non-packet based interface between interface polling and the netisr, rather than having polling pretend to be two protocols. Provide two explicit hooks in the netisr worker for start and end events for runs: netisr_poll() and netisr_pollmore(), as well as a function, netisr_sched_poll(), to allow the polling code to schedule netisr execution. DEVICE_POLLING still embeds single-netisr assumptions in its implementation, so for now if it is compiled into the kernel, a single and un-bound netisr thread is enforced regardless of tunable configuration.
In the default configuration, the new netisr implementation maintains the same basic assumptions as the previous implementation: a single, un-bound worker thread processes all deferred work, and direct dispatch is enabled by default wherever possible.
Performance measurement shows a marginal performance improvement over the old implementation due to the use of batched dequeue.
An rmlock is used to synchronize use and registration/unregistration using the framework; currently, synchronized use is disabled (replicating current netisr policy) due to a measurable 3%-6% hit in ping-pong micro-benchmarking. It will be enabled once further rmlock optimization has taken place. However, in practice, netisrs are rarely registered or unregistered at runtime.
A new man page for netisr will follow, but since one doesn't currently exist, it hasn't been updated.
This change is not appropriate for MFC, although the polling shutdown handler should be merged to 7-STABLE.
Bump __FreeBSD_version.
Reviewed by: bz
|
#
191688 |
|
30-Apr-2009 |
zec |
Permit buiding kernels with options VIMAGE, restricted to only a single active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_* family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted.
Reviewed by: bz, rwatson Approved by: julian (mentor)
|
#
191548 |
|
26-Apr-2009 |
zec |
In preparation for turning on options VIMAGE in next commits, rearrange / replace / adjust several INIT_VNET_* initializer macros, all of which currently resolve to whitespace.
Reviewed by: bz (an older version of the patch) Approved by: julian (mentor)
|
#
191287 |
|
19-Apr-2009 |
rwatson |
In divert_packet(), lock the interface address list before iterating over it in search of an address.
MFC after: 2 weeks
|
#
190951 |
|
11-Apr-2009 |
rwatson |
Update stats in struct ipstat using four new macros, IPSTAT_ADD(), IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly manipulating the fields across the kernel. This will make it easier to change the implementation of these statistics, such as using per-CPU versions of the data structures.
MFC after: 3 days
|
#
188066 |
|
03-Feb-2009 |
rrs |
Adds support for SCTP checksum offload. This means we, like TCP and UDP, move the checksum calculation into the IP routines when there is no hardware support we call into the normal SCTP checksum routine.
The next round of SCTP updates will use this functionality. Of course the IGB driver needs a few updates to support the new intel controller set that actually does SCTP csum offload too.
Reviewed by: gnn, rwatson, kmacy
|
#
185895 |
|
10-Dec-2008 |
zec |
Conditionally compile out V_ globals while instantiating the appropriate container structures, depending on VIMAGE_GLOBALS compile time option.
Make VIMAGE_GLOBALS a new compile-time option, which by default will not be defined, resulting in instatiations of global variables selected for V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be effectively compiled out. Instantiate new global container structures to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0, vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.
Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_ macros resolve either to the original globals, or to fields inside container structures, i.e. effectively
#ifdef VIMAGE_GLOBALS #define V_rt_tables rt_tables #else #define V_rt_tables vnet_net_0._rt_tables #endif
Update SYSCTL_V_*() macros to operate either on globals or on fields inside container structs.
Extend the internal kldsym() lookups with the ability to resolve selected fields inside the virtualization container structs. This applies only to the fields which are explicitly registered for kldsym() visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently this is done only in sys/net/if.c.
Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code, and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in turn result in proper code being generated depending on VIMAGE_GLOBALS.
De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c which were prematurely V_irtualized by automated V_ prepending scripts during earlier merging steps. PF virtualization will be done separately, most probably after next PF import.
Convert a few variable initializations at instantiation to initialization in init functions, most notably in ipfw. Also convert TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in initializer functions.
Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
|
#
185571 |
|
02-Dec-2008 |
bz |
Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files.
For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h.
Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation
|
#
185348 |
|
26-Nov-2008 |
zec |
Merge more of currently non-functional (i.e. resolving to whitespace) macros from p4/vimage branch.
Do a better job at enclosing all instantiations of globals scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.
De-virtualize and mark as const saorder_state_alive and saorder_state_any arrays from ipsec code, given that they are never updated at runtime, so virtualizing them would be pointless.
Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
|
#
185101 |
|
19-Nov-2008 |
julian |
Fix a scope problem in the multiple routing table code that stopped the SO_SETFIB socket option from working correctly.
Obtained from: Ironport MFC after: 3 days
|
#
185088 |
|
19-Nov-2008 |
zec |
Change the initialization methodology for global variables scheduled for virtualization.
Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks.
Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures.
Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
|
#
183982 |
|
17-Oct-2008 |
bz |
Add cr_canseeinpcb() doing checks using the cached socket credentials from inp_cred which is also available after the socket is gone. Switch cr_canseesocket consumers to cr_canseeinpcb. This removes an extra acquisition of the socket lock.
Reviewed by: rwatson MFC after: 3 months (set timer; decide then)
|
#
183550 |
|
02-Oct-2008 |
zec |
Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_*() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation
|
#
181803 |
|
17-Aug-2008 |
bz |
Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course of the next few weeks.
Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch
|
#
180851 |
|
27-Jul-2008 |
mav |
According to in_pcb.h protocol binding information has double locking. It allows access it while list travercing holding only global pcbinfo lock.
|
#
178376 |
|
21-Apr-2008 |
rwatson |
Read lock, rather than write lock, the inpcb when transmitting with or delivering to an IP divert socket.
MFC after: 3 months
|
#
178285 |
|
17-Apr-2008 |
rwatson |
Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to explicitly select write locking for all use of the inpcb mutex. Update some pcbinfo lock assertions to assert locked rather than write-locked, although in practice almost all uses of the pcbinfo rwlock main exclusive, and all instances of inpcb lock acquisition are exclusive.
This change should introduce (ideally) little functional change. However, it lays the groundwork for significantly increased parallelism in the TCP/IP code.
MFC after: 3 months Tested by: kris (superset of committered patch)
|
#
172930 |
|
24-Oct-2007 |
rwatson |
Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms:
mac_<object>_<method/action> mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names.
All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
|
#
172467 |
|
07-Oct-2007 |
silby |
Add FBSDID to all files in netinet so that people can more easily include file version information in bug reports.
Approved by: re (kensmith)
|
#
171746 |
|
06-Aug-2007 |
csjp |
Over the past couple of years, there have been a number of reports relating the use of divert sockets to dead locks. A number of LORs have been reported between divert and a number of other network subsystems including: IPSEC, Pfil, multicast, ipfw and others. Other dead locks could occur because of recursive entry into the IP stack. This change should take care of most if not all of these issues.
A summary of the changes follow:
- We disallow multicast operations on divert sockets. It really doesn't make semantic sense to allow this, since typically you would set multicast parameters on multicast end points.
NOTE: As a part of this change, we actually dis-allow multicast options on any socket that IS a divert socket OR IS NOT a SOCK_RAW or SOCK_DGRAM family
- We check to see if there are any socket options that have been specified on the socket, and if there was (which is very un-common and also probably doesnt make sense to support) we duplicate the mbuf carrying the options.
- We then drop the INP/INFO locks over the call to ip_output(). It should be noted that since we no longer support multicast operations on divert sockets and we have duplicated any socket options, we no longer need the reference to the pcb to be coherent.
- Finally, we replaced the call to ip_input() to use netisr queuing. This should remove the recursive entry into the IP stack from divert.
By dropping the locks over the call to ip_output() we eliminate all the lock ordering issues above. By switching over to netisr on the inbound path, we can no longer recursively enter the ip_input() code via divert.
I have tested this change by using the following command:
ipfwpcap -r 8000 - | tcpdump -r - -nn -v
This should exercise the input and re-injection (outbound) path, which is very similar to the work load performed by natd(8). Additionally, I have run some ospf daemons which have a heavy reliance on raw sockets and multicast.
Approved by: re@ (kensmith) MFC after: 1 month LOR: 163 LOR: 181 LOR: 202 LOR: 203 Discussed with: julian, andre et al (on freebsd-net) In collaboration with: bms [1], rwatson [2]
[1] bms helped out with the multicast decisions [2] rwatson submitted the original netisr patches and came up with some of the original ideas on how to combat this issue.
|
#
169462 |
|
11-May-2007 |
rwatson |
Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr protocol entry points using functions named proto_getsockaddr and proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr. While it's true that sockaddrs are allocated and set, the net effect is to retrieve (get) the socket address or peer address from a socket, not set it, so align names to that intent.
|
#
169461 |
|
11-May-2007 |
rwatson |
Remove unneeded wrappers for in_setsockaddr() and in_setpeeraddr(), which used to exist so pcbinfo locks could be acquired, but are no longer required as a result of socket/pcb reference model refinements.
|
#
169454 |
|
10-May-2007 |
rwatson |
Move universally to ANSI C function declarations, with relatively consistent style(9)-ish layout.
|
#
169179 |
|
01-May-2007 |
rwatson |
Remove unused pcbinfo arguments to in_setsockaddr() and in_setpeeraddr().
|
#
169154 |
|
30-Apr-2007 |
rwatson |
Rename some fields of struct inpcbinfo to have the ipi_ prefix, consistent with the naming of other structure field members, and reducing improper grep matches. Clean up and comment structure fields in structure definition.
|
#
165634 |
|
29-Dec-2006 |
jhb |
Some whitespace nits and remove a few casts.
|
#
164033 |
|
06-Nov-2006 |
rwatson |
Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking.
Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
|
#
163606 |
|
22-Oct-2006 |
rwatson |
Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project Sponsored by: SPARTA
|
#
160491 |
|
18-Jul-2006 |
ups |
Fix race conditions on enumerating pcb lists by moving the initialization ( and where appropriate the destruction) of the pcb mutex to the init/finit functions of the pcb zones. This allows locking of the pcb entries and race condition free comparison of the generation count. Rearrange locking a bit to avoid extra locking operation to update the generation count in in_pcballoc(). (in_pcballoc now returns the pcb locked)
I am planning to convert pcb list handling from a type safe to a reference count model soon. ( As this allows really freeing the PCBs)
Reviewed by: rwatson@, mohans@ MFC after: 1 week
|
#
160038 |
|
29-Jun-2006 |
yar |
There is a consensus that ifaddr.ifa_addr should never be NULL, except in places dealing with ifaddr creation or destruction; and in such special places incomplete ifaddrs should never be linked to system-wide data structures. Therefore we can eliminate all the superfluous checks for "ifa->ifa_addr != NULL" and get ready to the system crashing honestly instead of masking possible bugs.
Suggested by: glebius, jhb, ru
|
#
157927 |
|
21-Apr-2006 |
ps |
Allow for nmbclusters and maxsockets to be increased via sysctl. An eventhandler is used to update all the various zones that depend on these values.
|
#
157423 |
|
03-Apr-2006 |
rwatson |
Correct incorrect assertion in div_bind(): inp must not be NULL here.
Reported by: tegge MFC after: 3 months
|
#
157374 |
|
01-Apr-2006 |
rwatson |
Update in_pcb-derived basic socket types following changes to pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, in protocol shutdown methods, and in raw IP send.
- Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this.
- Invoke in_pcbfree() after in_pcbdetach() in order to free the detached in_pcb structure for a socket.
MFC after: 3 months
|
#
157370 |
|
01-Apr-2006 |
rwatson |
Chance protocol switch method pru_detach() so that it returns void rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket.
soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals.
Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it.
In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach.
netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic.
MFC after: 3 months
|
#
152242 |
|
09-Nov-2005 |
ru |
Use sparse initializers for "struct domain" and "struct protosw", so they are easier to follow for the human being.
|
#
146182 |
|
13-May-2005 |
glebius |
In div_output() explicitly set m->m_nextpkt to NULL. If divert socket is not userland, but ng_ksocket, then m->m_nextpkt may be non-NULL. In this case we would panic in sbappend.
|
#
145953 |
|
06-May-2005 |
cperciva |
If we are going to 1. Copy a NULL-terminated string into a fixed-length buffer, and 2. copyout that buffer to userland, we really ought to 0. Zero the entire buffer first.
Security: FreeBSD-SA-05:08.kmem
|
#
139823 |
|
07-Jan-2005 |
imp |
/* -> /*- for license, minor formatting changes
|
#
137860 |
|
18-Nov-2004 |
glebius |
- Since divert protocol is not connection oriented, remove SS_ISCONNECTED flag from divert sockets. - Remove div_disconnect() method, since it shouldn't be called now. - Remove div_abort() method. It was never called directly, since protocol doesn't have listen queue. It was called only from div_disconnect(), which is removed now.
Reviewed by: rwatson, maxim Approved by: julian (mentor) MT5 after: 1 week MT4 after: 1 month
|
#
137630 |
|
12-Nov-2004 |
glebius |
Fix ng_ksocket(4) operation as a divert socket, which is pretty useful and has been broken twice:
- in the beginning of div_output() replace KASSERT with assignment, as it was in rev. 1.83. [1] [to be MFCed] - refactor changes introduced in rev. 1.100: do not prepend a new tag unconditionally. Before doing this check whether we have one. [2]
A small note for all hacking in this area: when divert socket is not a real userland, but ng_ksocket(4), we receive _the same_ mbufs, that we transmitted to socket. These mbufs have rcvif, the tags we've put on them. And we should treat them correctly.
Discussed with: mlaier [1] Silence from: green [2] Reviewed by: maxim Approved by: julian (mentor) MFC after: 1 week
|
#
137584 |
|
11-Nov-2004 |
phk |
Add missing '='
Spotted by: obrien
|
#
137386 |
|
08-Nov-2004 |
phk |
Initialize struct pr_userreqs in new/sparse style and fill in common default elements in net_init_domain().
This makes it possible to grep these structures and see any bogosities.
|
#
136953 |
|
25-Oct-2004 |
andre |
IPDIVERT is a module now and tell the other parts of the kernel about it. IPDIVERT depends on IPFIREWALL being loaded or compiled into the kernel.
|
#
136788 |
|
22-Oct-2004 |
andre |
Refuse to unload the ipdivert module unless the 'force' flag is given to kldunload.
Reflect the fact that IPDIVERT is a loadable module in the divert(4) and ipfw(8) man pages.
|
#
136717 |
|
19-Oct-2004 |
andre |
Destroy the UMA zone on unload.
|
#
136716 |
|
19-Oct-2004 |
andre |
Slightly extend the locking during unload to fully cover the protocol deregistration. This does not entirely close the race but narrows the even previously extremely small chance of a race some more.
|
#
136715 |
|
19-Oct-2004 |
rwatson |
Annotate a newly introduced race present due to the unloading of protocols: it is possible for sockets to be created and attached to the divert protocol between the test for sockets present and successful unload of the registration handler. We will need to explore more mature APIs for unregistering the protocol and then draining consumers, or an atomic test-and-unregister mechanism.
|
#
136714 |
|
19-Oct-2004 |
andre |
Convert IPDIVERT into a loadable module. This makes use of the dynamic loadability of protocols. The call to divert_packet() is done through a function pointer. All semantics of IPDIVERT remain intact. If IPDIVERT is not loaded ipfw will refuse to install divert rules and natd will complain about 'protocol not supported'. Once it is loaded both will work and accept rules and open the divert socket. The module can only be unloaded if no divert sockets are open. It does not close any divert sockets when an unload is requested but will return EBUSY instead.
|
#
136073 |
|
03-Oct-2004 |
green |
Add support to IPFW for classification based on "diverted" status (that is, input via a divert socket).
|
#
134793 |
|
05-Sep-2004 |
jmg |
fix up socket/ip layer violation... don't assume/know that SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...
|
#
133920 |
|
17-Aug-2004 |
andre |
Convert ipfw to use PFIL_HOOKS. This is change is transparent to userland and preserves the ipfw ABI. The ipfw core packet inspection and filtering functions have not been changed, only how ipfw is invoked is different.
However there are many changes how ipfw is and its add-on's are handled:
In general ipfw is now called through the PFIL_HOOKS and most associated magic, that was in ip_input() or ip_output() previously, is now done in ipfw_check_[in|out]() in the ipfw PFIL handler.
IPDIVERT is entirely handled within the ipfw PFIL handlers. A packet to be diverted is checked if it is fragmented, if yes, ip_reass() gets in for reassembly. If not, or all fragments arrived and the packet is complete, divert_packet is called directly. For 'tee' no reassembly attempt is made and a copy of the packet is sent to the divert socket unmodified. The original packet continues its way through ip_input/output().
ipfw 'forward' is done via m_tag's. The ipfw PFIL handlers tag the packet with the new destination sockaddr_in. A check if the new destination is a local IP address is made and the m_flags are set appropriately. ip_input() and ip_output() have some more work to do here. For ip_input() the m_flags are checked and a packet for us is directly sent to the 'ours' section for further processing. Destination changes on the input path are only tagged and the 'srcrt' flag to ip_forward() is set to disable destination checks and ICMP replies at this stage. The tag is going to be handled on output. ip_output() again checks for m_flags and the 'ours' tag. If found, the packet will be dropped back to the IP netisr where it is going to be picked up by ip_input() again and the directly sent to the 'ours' section. When only the destination changes, the route's 'dst' is overwritten with the new destination from the forward m_tag. Then it jumps back at the route lookup again and skips the firewall check because it has been marked with M_SKIP_FIREWALL. ipfw 'forward' has to be compiled into the kernel with 'option IPFIREWALL_FORWARD' to enable it.
DUMMYNET is entirely handled within the ipfw PFIL handlers. A packet for a dummynet pipe or queue is directly sent to dummynet_io(). Dummynet will then inject it back into ip_input/ip_output() after it has served its time. Dummynet packets are tagged and will continue from the next rule when they hit the ipfw PFIL handlers again after re-injection.
BRIDGING and IPFW_ETHER are not changed yet and use ipfw_chk() directly as they did before. Later this will be changed to dedicated ETHER PFIL_HOOKS.
More detailed changes to the code:
conf/files Add netinet/ip_fw_pfil.c.
conf/options Add IPFIREWALL_FORWARD option.
modules/ipfw/Makefile Add ip_fw_pfil.c.
net/bridge.c Disable PFIL_HOOKS if ipfw for bridging is active. Bridging ipfw is still directly invoked to handle layer2 headers and packets would get a double ipfw when run through PFIL_HOOKS as well.
netinet/ip_divert.c Removed divert_clone() function. It is no longer used.
netinet/ip_dummynet.[ch] Neither the route 'ro' nor the destination 'dst' need to be stored while in dummynet transit. Structure members and associated macros are removed.
netinet/ip_fastfwd.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code.
netinet/ip_fw.h Removed 'ro' and 'dst' from struct ip_fw_args.
netinet/ip_fw2.c (Re)moved some global variables and the module handling.
netinet/ip_fw_pfil.c New file containing the ipfw PFIL handlers and module initialization.
netinet/ip_input.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code. ip_forward() does not longer require the 'next_hop' struct sockaddr_in argument. Disable early checks if 'srcrt' is set.
netinet/ip_output.c Removed all direct ipfw handling code and replace it with the new 'ipfw forward' handling code.
netinet/ip_var.h Add ip_reass() as general function. (Used from ipfw PFIL handlers for IPDIVERT.)
netinet/raw_ip.c Directly check if ipfw and dummynet control pointers are active.
netinet/tcp_input.c Rework the 'ipfw forward' to local code to work with the new way of forward tags.
netinet/tcp_sack.c Remove include 'opt_ipfw.h' which is not needed here.
sys/mbuf.h Remove m_claim_next() macro which was exclusively for ipfw 'forward' and is no longer needed.
Approved by: re (scottl)
|
#
133517 |
|
11-Aug-2004 |
andre |
Backout removal of UMA_ZONE_NOFREE flag for all zones which are established for structures with timers in them. It might be that a timer might fire even when the associated structure has already been free'd. Having type- stable storage in this case is beneficial for graceful failure handling and debugging.
Discussed with: bosko, tegge, rwatson
|
#
133509 |
|
11-Aug-2004 |
andre |
Remove the UMA_ZONE_NOFREE flag to all uma_zcreate() calls in the IP and TCP code. This flag would have prevented giving back excessive free slabs to the global pool after a transient peak usage.
|
#
133069 |
|
03-Aug-2004 |
andre |
o Move all parts of the IP reassembly process into the function ip_reass() to make it fully self-contained. o ip_reass() now returns a new mbuf with the reassembled packet and ip->ip_len including the IP header. o Computation of the delayed checksum is moved into divert_packet().
Reviewed by: silby
|
#
131208 |
|
27-Jun-2004 |
phk |
Rwatson, write 100 times for tomorrow:
First unlock, then assign NULL to pointer.
|
#
131151 |
|
26-Jun-2004 |
rwatson |
Reduce the number of unnecessary unlock-relocks on socket buffer mutexes associated with performing a wakeup on the socket buffer:
- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly acquire the socket buffer lock and use the _locked() variants of both calls. Note that the _locked() sowakeup() versions unlock the mutex on return. This is done in uipc_send(), divert_packet(), mroute socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().
- When the socket buffer lock is dropped before a sowakeup(), remove the explicit unlock and use the _locked() sowakeup() variant. This is done in soisdisconnecting(), soisdisconnected() when setting the can't send/ receive flags and dropping data, and in uipc_rcvd() which adjusting back-pressure on the sockets.
For UNIX domain sockets running mpsafe with a contention-intensive SMP mysql benchmark, this results in a 1.6% query rate improvement due to reduce mutex costs.
|
#
130901 |
|
22-Jun-2004 |
rwatson |
Acquire socket lock around frobbing of socket state in divert sockets.
|
#
130900 |
|
22-Jun-2004 |
rwatson |
Prefer use of the inpcb as a MAC label source for outgoing packets sent via divert sockets, when available.
|
#
130398 |
|
13-Jun-2004 |
rwatson |
Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so):
- Hold socket lock over calls to MAC entry points reading or manipulating socket labels.
- Assert socket lock in MAC entry point implementations.
- When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.
|
#
130337 |
|
11-Jun-2004 |
rwatson |
Remove unneeded Giant acquisition in divert_packet(), which is left over from debug.mpsafenet affecting only the forwarding plane. Giant is now acquired in the ithread/netisr or in the system call code.
|
#
128019 |
|
07-Apr-2004 |
imp |
Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson.
Approved by: core, peter, alc, rwatson
|
#
127505 |
|
27-Mar-2004 |
pjd |
Reduce 'td' argument to 'cred' (struct ucred) argument in those functions: - in_pcbbind(), - in_pcbbind_setup(), - in_pcbconnect(), - in_pcbconnect_setup(), - in6_pcbbind(), - in6_pcbconnect(), - in6_pcbsetport(). "It should simplify/clarify things a great deal." --rwatson
Requested by: rwatson Reviewed by: rwatson, ume
|
#
127504 |
|
27-Mar-2004 |
pjd |
Remove unused argument.
Reviewed by: ume
|
#
126253 |
|
26-Feb-2004 |
truckman |
Split the mlock() kernel code into two parts, mlock(), which unpacks the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way.
Enable the RLIMIT_MEMLOCK checking code in kern_mlock().
Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits.
Nuke the vslock() and vsunlock() implementations, which are no longer used.
Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request.
Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request.
Modify the callers of sysctl_wire_old_buffer() to look for the error return.
Modify sysctl_old_user to obey the wired buffer length and clean up its implementation.
Reviewed by: bms
|
#
126239 |
|
25-Feb-2004 |
mlaier |
Re-remove MT_TAGs. The problems with dummynet have been fixed now.
Tested by: -current, bms(mentor), me Approved by: bms(mentor), sam
|
#
125952 |
|
18-Feb-2004 |
mlaier |
Backout MT_TAG removal (i.e. bring back MT_TAGs) for now, as dummynet is not working properly with the patch in place.
Approved by: bms(mentor)
|
#
125784 |
|
13-Feb-2004 |
mlaier |
This set of changes eliminates the use of MT_TAG "pseudo mbufs", replacing them mostly with packet tags (one case is handled by using an mbuf flag since the linkage between "caller" and "callee" is direct and there's no need to incur the overhead of a packet tag).
This is (mostly) work from: sam
Silence from: -arch Approved by: bms(mentor), sam, rwatson
|
#
122991 |
|
26-Nov-2003 |
sam |
Split the "inp" mutex class into separate classes for each of divert, raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness complaints.
Reviewed by: rwatson Approved by: re (rwatson)
|
#
122922 |
|
20-Nov-2003 |
andre |
Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache.
It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve.
tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address.
It removes significant locking requirements from the tcp stack with regard to the routing table.
Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)
|
#
122875 |
|
18-Nov-2003 |
rwatson |
Introduce a MAC label reference in 'struct inpcb', which caches the MAC label referenced from 'struct socket' in the IPv4 and IPv6-based protocols. This permits MAC labels to be checked during network delivery operations without dereferencing inp->inp_socket to get to so->so_label, which will eventually avoid our having to grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the MAC Framework, along with the normal circus of entry points: initialization, creation from socket, destruction, as well as a delivery access control check.
For most policies, the inpcb label will simply be a cache of the socket label, so a new protocol switch method is introduced, pr_sosetlabel() to notify protocols that the socket layer label has been updated so that the cache can be updated while holding appropriate locks. Most protocols implement this using pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use the the worker function in_pcbsosetlabel(), which calls into the MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub policy, and test policy.
Reviewed by: sam, bms Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
|
#
122828 |
|
17-Nov-2003 |
green |
Fix a few cases where MT_TAG-type "fake mbufs" are created on the stack, but do not have mh_nextpkt initialized. Somtimes what's there is "1", and the ip_input() code pukes trying to m_free() it, rendering divert sockets and such broken. This really underscores the need to get rid of MT_TAG.
Reviewed by: rwatson
|
#
122331 |
|
08-Nov-2003 |
sam |
divert socket fixups:
o pickup Giant in divert_packet to protect sbappendaddr since it can be entered through MPSAFE callouts or through ip_input when mpsafenet is 1 o add missing locking on output o add locking to abort and shutdown o add a ctlinput handler to invalidate held routing table references on an ICMP redirect (may not be needed)
Supported by: FreeBSD Foundation
|
#
121816 |
|
31-Oct-2003 |
brooks |
Replace the if_name and if_unit members of struct ifnet with new members if_xname, if_dname, and if_dunit. if_xname is the name of the interface and if_dname/unit are the driver name and instance.
This change paves the way for interface renaming and enhanced pseudo device creation and configuration symantics.
Approved By: re (in principle) Reviewed By: njl, imp Tested On: i386, amd64, sparc64 Obtained From: NetBSD (if_xname)
|
#
119752 |
|
05-Sep-2003 |
sam |
o add locking o move the global divsrc socket address to a local variable instead of locking it
Sponsored by: FreeBSD Foundation
|
#
113255 |
|
08-Apr-2003 |
des |
Introduce an M_ASSERTPKTHDR() macro which performs the very common task of asserting that an mbuf has a packet header. Use it instead of hand- rolled versions wherever applicable.
Submitted by: Hiten Pandya <hiten@unixdaemons.com>
|
#
111119 |
|
19-Feb-2003 |
imp |
Back out M_* changes, per decision of the TRB.
Approved by: trb
|
#
110008 |
|
28-Jan-2003 |
phk |
Check bounds for index before dereferencing memory past end of array.
Found by: FlexeLint
|
#
109623 |
|
21-Jan-2003 |
alfred |
Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
|
#
106152 |
|
29-Oct-2002 |
fenner |
Renumber IPPROTO_DIVERT out of the range of valid IP protocol numbers. This allows socket() to return an error when the kernel is not built with IPDIVERT, and doesn't prevent future applications from using the "borrowed" IP protocol number. The sysctl net.inet.raw.olddiverterror controls whether opening a socket with the "borrowed" IP protocol fails with an accompanying kernel printf; this code should last only a couple of releases.
Approved by: re
|
#
105856 |
|
24-Oct-2002 |
mux |
Fix kernel build on sparc64 in the IPDIVERT case.
|
#
105194 |
|
16-Oct-2002 |
sam |
Replace aux mbufs with packet tags:
o instead of a list of mbufs use a list of m_tag structures a la openbsd o for netgraph et. al. extend the stock openbsd m_tag to include a 32-bit ABI/module number cookie o for openbsd compatibility define a well-known cookie MTAG_ABI_COMPAT and use this in defining openbsd-compatible m_tag_find and m_tag_get routines o rewrite KAME use of aux mbufs in terms of packet tags o eliminate the most heavily used aux mbufs by adding an additional struct inpcb parameter to ip_output and ip6_output to allow the IPsec code to locate the security policy to apply to outbound packets o bump __FreeBSD_version so code can be conditionalized o fixup ipfilter's call to ip_output based on __FreeBSD_version
Reviewed by: julian, luigi (silent), -arch, -net, darren Approved by: julian, silence from everyone else Obtained from: openbsd (mostly) MFC after: 1 month
|
#
101088 |
|
31-Jul-2002 |
rwatson |
Introduce support for Mandatory Access Control and extensible kernel access control.
Invoke the MAC framework to label mbuf created using divert sockets. These labels may later be used for access control on delivery to another socket, or to an interface.
Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI LAbs
|
#
98664 |
|
23-Jun-2002 |
luigi |
fix a typo in a comment
|
#
98613 |
|
22-Jun-2002 |
luigi |
Remove (almost all) global variables that were used to hold packet forwarding state ("annotations") during ip processing. The code is considerably cleaner now.
The variables removed by this change are:
ip_divert_cookie used by divert sockets ip_fw_fwd_addr used for transparent ip redirection last_pkt used by dynamic pipes in dummynet
Removal of the first two has been done by carrying the annotations into volatile structs prepended to the mbuf chains, and adding appropriate code to add/remove annotations in the routines which make use of them, i.e. ip_input(), ip_output(), tcp_input(), bdg_forward(), ether_demux(), ether_output_frame(), div_output().
On passing, remove a bug in divert handling of fragmented packet. Now it is the fragment at offset 0 which sets the divert status of the whole packet, whereas formerly it was the last incoming fragment to decide.
Removal of last_pkt required a change in the interface of ip_fw_chk() and dummynet_io(). On passing, use the same mechanism for dummynet annotations and for divert/forward annotations.
option IPFIREWALL_FORWARD is effectively useless, the code to implement it is very small and is now in by default to avoid the obfuscation of conditionally compiled code.
NOTES: * there is at least one global variable left, sro_fwd, in ip_output(). I am not sure if/how this can be removed.
* I have deliberately avoided gratuitous style changes in this commit to avoid cluttering the diffs. Minor stule cleanup will likely be necessary
* this commit only focused on the IP layer. I am sure there is a number of global variables used in the TCP and maybe UDP stack.
* despite the number of files touched, there are absolutely no API's or data structures changed by this commit (except the interfaces of ip_fw_chk() and dummynet_io(), which are internal anyways), so an MFC is quite safe and unintrusive (and desirable, given the improved readability of the code).
MFC after: 10 days
|
#
98115 |
|
11-Jun-2002 |
hsu |
Remember to initialize the control block head mutex.
|
#
98114 |
|
11-Jun-2002 |
hsu |
Fix typo.
Submitted by: Kyunghwan Kim <redjade@atropos.snu.ac.kr>
|
#
98102 |
|
10-Jun-2002 |
hsu |
Lock up inpcb.
Submitted by: Jennifer Yang <yangjihui@yahoo.com>
|
#
97658 |
|
31-May-2002 |
tanimura |
Back out my lats commit of locking down a socket, it conflicts with hsu's work.
Requested by: hsu
|
#
96972 |
|
20-May-2002 |
tanimura |
Lock down a socket, milestone 1.
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket.
o Determine the lock strategy for each members in struct socket.
o Lock down the following members:
- so_count - so_options - so_linger - so_state
o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket:
- sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup()
Reviewed by: alfred
|
#
95759 |
|
30-Apr-2002 |
tanimura |
Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.
Requested by: bde
Since locking sigio_lock is usually followed by calling pgsigio(), move the declaration of sigio_lock and the definitions of SIGIO_*() to sys/signalvar.h.
While I am here, sort include files alphabetically, where possible.
|
#
94304 |
|
09-Apr-2002 |
jhb |
Change the first argument of prison_xinpcb() to be a thread pointer instead of a proc pointer so that prison_xinpcb() can use td_ucred.
|
#
93593 |
|
01-Apr-2002 |
jhb |
Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
|
#
92760 |
|
20-Mar-2002 |
jeff |
Switch vm_zone.h with uma.h. Change over to uma interfaces.
|
#
90868 |
|
18-Feb-2002 |
mike |
o Move NTOHL() and associated macros into <sys/param.h>. These are deprecated in favor of the POSIX-defined lowercase variants. o Change all occurrences of NTOHL() and associated marcros in the source tree to use the lowercase function variants. o Add missing license bits to sparc64's <machine/endian.h>. Approved by: jake o Clean up <machine/endian.h> files. o Remove unused __uint16_swap_uint32() from i386's <machine/endian.h>. o Remove prototypes for non-existent bswapXX() functions. o Include <machine/endian.h> in <arpa/inet.h> to define the POSIX-required ntohl() family of functions. o Do similar things to expose the ntohl() family in libstand, <netinet/in.h>, and <sys/param.h>. o Prepend underscores to the ntohl() family to help deal with complexities associated with having MD (asm and inline) versions, and having to prevent exposure of these functions in other headers that happen to make use of endian-specific defines. o Create weak aliases to the canonical function name to help deal with third-party software forgetting to include an appropriate header. o Remove some now unneeded pollution from <sys/types.h>. o Add missing <arpa/inet.h> includes in userland.
Tested on: alpha, i386 Reviewed by: bde, jake, tmm
|
#
87599 |
|
10-Dec-2001 |
obrien |
Update to C99, s/__FUNCTION__/__func__/, also don't use ANSI string concatenation.
|
#
86183 |
|
08-Nov-2001 |
rwatson |
o Replace reference to 'struct proc' with 'struct thread' in 'struct sysctl_req', which describes in-progress sysctl requests. This permits sysctl handlers to have access to the current thread, permitting work on implementing td->td_ucred, migration of suser() to using struct thread to derive the appropriate ucred, and allowing struct thread to be passed down to other code, such as network code where td is not currently available (and curproc is used).
o Note: netncp and netsmb are not updated to reflect this change, as they are not currently KSE-adapted.
Reviewed by: julian Obtained from: TrustedBSD Project
|
#
83366 |
|
12-Sep-2001 |
julian |
KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
|
#
82884 |
|
03-Sep-2001 |
julian |
Patches from Keiichi SHIMA <keiichi@iij.ad.jp> to make ip use the standard protosw structure again.
Obtained from: Well, KAME I guess.
|
#
80406 |
|
26-Jul-2001 |
ume |
move ipsec security policy allocation into in_pcballoc, before making pcbs available to the outside world. otherwise, we will see inpcb without ipsec security policy attached (-> panic() in ipsec.c).
Obtained from: KAME MFC after: 3 days
|
#
71999 |
|
04-Feb-2001 |
phk |
Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details.
Created with: sed(1) Reviewed by: md5(1)
|
#
67893 |
|
29-Oct-2000 |
phk |
Move suser() and suser_xxx() prototypes and a related #define from <sys/proc.h> to <sys/systm.h>.
Correctly document the #includes needed in the manpage.
Add one now needed #include of <sys/systm.h>. Remove the consequent 48 unused #includes of <sys/proc.h>.
|
#
65837 |
|
14-Sep-2000 |
ru |
Follow BSD/OS and NetBSD, keep the ip_id field in network order all the time.
Requested by: wollman
|
#
65327 |
|
01-Sep-2000 |
ru |
Fixed broken ICMP error generation, unified conversion of IP header fields between host and network byte order. The details:
o icmp_error() now does not add IP header length. This fixes the problem when icmp_error() is called from ip_forward(). In this case the ip_len of the original IP datagram returned with ICMP error was wrong.
o icmp_error() expects all three fields, ip_len, ip_id and ip_off in host byte order, so DTRT and convert these fields back to network byte order before sending a message. This fixes the problem described in PR 16240 and PR 20877 (ip_id field was returned in host byte order).
o ip_ttl decrement operation in ip_forward() was moved down to make sure that it does not corrupt the copy of original IP datagram passed later to icmp_error().
o A copy of original IP datagram in ip_forward() was made a read-write, independent copy. This fixes the problem I first reported to Garrett Wollman and Bill Fenner and later put in audit trail of PR 16240: ip_output() (not always) converts fields of original datagram to network byte order, but because copy (mcopy) and its original (m) most likely share the same mbuf cluster, ip_output()'s manipulations on original also corrupted the copy.
o ip_output() now expects all three fields, ip_len, ip_off and (what is significant) ip_id in host byte order. It was a headache for years that ip_id was handled differently. The only compatibility issue here is the raw IP socket interface with IP_HDRINCL socket option set and a non-zero ip_id field, but ip.4 manual page was unclear on whether in this case ip_id field should be in host or network byte order.
|
#
65260 |
|
30-Aug-2000 |
ru |
Fixed the bug that div_bind() always returned zero even if there was an error (broken in rev 1.9).
|
#
64192 |
|
03-Aug-2000 |
ru |
Make netstat(1) to be aware of divert(4) sockets.
|
#
59909 |
|
02-May-2000 |
paul |
Force the address of the socket to be INADDR_ANY immediately before calling in_pcbbind so that in_pcbbind sees a valid address if no address was specified (since divert sockets ignore them).
PR: 17552 Reviewed by: Brian
|
#
55601 |
|
08-Jan-2000 |
shin |
prevent kernel panic which happens when either of IPSEC and IPDIVERT is enabled.
Confirmed by: Eugene M. Kim <ab@astralblue.com>
|
#
55009 |
|
22-Dec-1999 |
shin |
IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers.
Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
|
#
54175 |
|
06-Dec-1999 |
archie |
Miscellaneous fixes/cleanups relating to ipfw and divert(4):
- Implement 'ipfw tee' (finally) - Divert packets by calling new function divert_packet() directly instead of going through protosw[]. - Replace kludgey global variable 'ip_divert_port' with a function parameter to divert_packet() - Replace kludgey global variable 'frag_divert_port' with a function parameter to ip_reass() - style(9) fixes
Reviewed by: julian, green
|
#
50477 |
|
28-Aug-1999 |
peter |
$Id$ -> $FreeBSD$
|
#
46112 |
|
27-Apr-1999 |
phk |
Suser() simplification:
1: s/suser/suser_xxx/
2: Add new function: suser(struct proc *), prototyped in <sys/proc.h>.
3: s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/
The remaining suser_xxx() calls will be scrutinized and dealt with later.
There may be some unneeded #include <sys/cred.h>, but they are left as an exercise for Bruce.
More changes to the suser() API will come along with the "jail" code.
|
#
43764 |
|
08-Feb-1999 |
julian |
remove leftover garbage line.
|
#
43763 |
|
08-Feb-1999 |
julian |
Fix for PR 9309. Divert was not feeding clean data to ifa_ifwithaddr() so it was giving bad results. Submitted by: kseel <kseel@utcorp.com>, Ruslan Ermilov <ru@ucb.crimea.ua>
|
#
41514 |
|
04-Dec-1998 |
archie |
Examine all occurrences of sprintf(), strcat(), and str[n]cpy() for possible buffer overflow problems. Replaced most sprintf()'s with snprintf(); for others cases, added terminating NUL bytes where appropriate, replaced constants like "16" with sizeof(), etc.
These changes include several bug fixes, but most changes are for maintainability's sake. Any instance where it wasn't "immediately obvious" that a buffer overflow could not occur was made safer.
Reviewed by: Bruce Evans <bde@zeta.org.au> Reviewed by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Mike Spengler <mks@networkcs.com>
|
#
37433 |
|
06-Jul-1998 |
julian |
Bring back some slight cleanups from 2.2
|
#
37334 |
|
02-Jul-1998 |
julian |
Remove out of date comment.
|
#
37332 |
|
02-Jul-1998 |
julian |
Remove the option to keep IPFW diversion backwards compatible WRT diversion reinjection. No-one has been bitten by the new behaviour that I know of.
|
#
36906 |
|
12-Jun-1998 |
julian |
include opt_ipdivert.h so we get correct options
|
#
36903 |
|
12-Jun-1998 |
julian |
Allow diverted packets from the transmit side to remember if they had a recv interface and allow that state to be available after re-injection for further tests.
|
#
36708 |
|
06-Jun-1998 |
julian |
Fix wrong data type for a pointer.
|
#
36707 |
|
06-Jun-1998 |
julian |
clean up the changes made to ipfw over the last weeks (should make the ipfw lkm work again)
|
#
36678 |
|
05-Jun-1998 |
julian |
Reverse the default sense of the IPFW/DIVERT reinjection code so that the new behaviour is now default. Solves the "infinite loop in diversion" problem when more than one diversion is active. Man page changes follow.
The new code is in -stable as the NON default option.
|
#
36369 |
|
25-May-1998 |
julian |
Add optional code to change the way that divert and ipfw work together. Prior to this change, Accidental recursion protection was done by the diverted daemon feeding back the divert port number it got the packet on, as the port number on a sendto(). IPFW knew not to redivert a packet to this port (again). Processing of the ruleset started at the beginning again, skipping that divert port.
The new semantic (which is how we should have done it the first time) is that the port number in the sendto() is the rule number AFTER which processing should restart, and on a recvfrom(), the port number is the rule number which caused the diversion. This is much more flexible, and also more intuitive. If the user uses the same sockaddr received when resending, processing resumes at the rule number following that that caused the diversion. The user can however select to resume rule processing at any rule. (0 is restart at the beginning)
To enable the new code use
option IPFW_DIVERT_RESTART
This should become the default as soon as people have looked at it a bit
|
#
36364 |
|
25-May-1998 |
julian |
Hide the interface name in the sin_zero section of the sockaddr_in passed to the user process for incoming packets. When the sockaddr_in is passed back to the divert socket later, use thi sas the primary interface lookup and only revert to the IP address when the name fails. This solves a long standing bug with divert sockets: When two interfaces had the same address (P2P for example) the interface "assigned" to the reinjected packet was sometimes incorect. Probably we should define a "sockaddr_div" to officially hold this extended information in teh same manner as sockaddr_dl.
|
#
36363 |
|
25-May-1998 |
julian |
Take the user's "IGNORE_DIVERT" argument from where the user put it and not from the PCB which HAPPENS to contain the same number most of the time, but not always.
|
#
36079 |
|
15-May-1998 |
wollman |
Convert socket structures to be type-stable and add a version number.
Define a parameter which indicates the maximum number of sockets in a system, and use this to size the zone allocators used for sockets and for certain PCBs.
Convert PF_LOCAL PCB structures to be type-stable and add a version number.
Define an external format for infomation about socket structures and use it in several places.
Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through sysctl(3) without blocking network interrupts for an unreasonable length of time. This probably still has some bugs and/or race conditions, but it seems to work well enough on my machines.
It is now possible for `netstat' to get almost all of its information via the sysctl(3) interface rather than reading kmem (changes to follow).
|
#
34923 |
|
28-Mar-1998 |
bde |
Fixed style bugs (mostly) in previous commit.
|
#
34881 |
|
24-Mar-1998 |
wollman |
Use the zone allocator to allocate inpcbs and tcpcbs. Each protocol creates its own zone; this is used particularly by TCP which allocates both inpcb and tcpcb in a single allocation. (Some hackery ensures that the tcpcb is reasonably aligned.) Also keep track of the number of pcbs of each type allocated, and keep a generation count (instance version number) for future use.
|
#
33134 |
|
06-Feb-1998 |
eivind |
Back out DIAGNOSTIC changes.
|
#
33108 |
|
04-Feb-1998 |
eivind |
Turn DIAGNOSTIC into a new-style option.
|
#
32821 |
|
27-Jan-1998 |
dg |
Improved connection establishment performance by doing local port lookups via a hashed port list. In the new scheme, in_pcblookup() goes away and is replaced by a new routine, in_pcblookup_local() for doing the local port check. Note that this implementation is space inefficient in that the PCB struct is now too large to fit into 128 bytes. I might deal with this in the future by using the new zone allocator, but I wanted these changes to be extensively tested in their current form first.
Also: 1) Fixed off-by-one errors in the port lookup loops in in_pcbbind(). 2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash() to do the initialial hash insertion. 3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability. 4) Added a new routine, in_pcbremlists() to remove the PCB from the various hash lists. 5) Added/deleted comments where appropriate. 6) Removed unnecessary splnet() locking. In general, the PCB functions should be called at splnet()...there are unfortunately a few exceptions, however. 7) Reorganized a few structs for better cache line behavior. 8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in the future, however.
These changes have been tested on wcarchive for more than a month. In tests done here, connection establishment overhead is reduced by more than 50 times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be recompiled; at the very least, this includes netstat(1).
|
#
32350 |
|
08-Jan-1998 |
eivind |
Make INET a proper option.
This will not make any of object files that LINT create change; there might be differences with INET disabled, but hardly anything compiled before without INET anyway. Now the 'obvious' things will give a proper error if compiled without inet - ipx_ip, ipfw, tcp_debug. The only thing that _should_ work (but can't be made to compile reasonably easily) is sppp :-(
This commit move struct arpcom from <netinet/if_ether.h> to <net/if_arp.h>.
|
#
31838 |
|
18-Dec-1997 |
dg |
Call in_pcballoc() at splnet(). As near as I can tell, this won't fix any instability problems, but it was wrong nonetheless and will be required in an upcoming round of PCB changes.
|
#
29366 |
|
14-Sep-1997 |
peter |
Update network code to use poll support.
|
#
29327 |
|
13-Sep-1997 |
peter |
Some mbuf -> sockaddr changes seem to have been missed here.
|
#
27845 |
|
02-Aug-1997 |
bde |
Removed unused #includes.
|
#
26359 |
|
02-Jun-1997 |
julian |
Submitted by: Whistle Communications (archie Cobbs)
these are quite extensive additions to the ipfw code. they include a change to the API because the old method was broken, but the user view is kept the same.
The new code allows a particular match to skip forward to a particular line number, so that blocks of rules can be used without checking all the intervening rules. There are also many more ways of rejecting connections especially TCP related, and many many more ...
see the man page for a complete description.
|
#
26345 |
|
01-Jun-1997 |
peter |
typo fix, s/imp/inp'; move lookup call inside splnet since there were comments on it being outside.
|
#
26147 |
|
26-May-1997 |
peter |
Uninitialised inp variable in div_bind().
Submitted by: Åge Røbekk <aagero@aage.priv.no>
|
#
26096 |
|
24-May-1997 |
peter |
Attempt to convert the ip_divert code to use the new-style protocol request switch. I needed 'LINT' to compile for other reasons so I kinda got the blood on my hands. Note: I don't know how to test this, I don't know if it works correctly.
|
#
24570 |
|
03-Apr-1997 |
dg |
Reorganize elements of the inpcb struct to take better advantage of cache lines. Removed the struct ip proto since only a couple of chars were actually being used in it. Changed the order of compares in the PCB hash lookup to take advantage of partial cache line fills (on PPro).
Discussed-with: wollman
|
#
23324 |
|
03-Mar-1997 |
dg |
Improved performance of hash algorithm while (hopefully) not reducing the quality of the hash distribution. This does not fix a problem dealing with poor distribution when using lots of IP aliases and listening on the same port on every one of them...some other day perhaps; fixing that requires significant code changes. The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>
|
#
22975 |
|
22-Feb-1997 |
peter |
Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
|
#
22952 |
|
20-Feb-1997 |
wollman |
Fix the parameters of a call to in_setsockaddr().
|
#
22212 |
|
02-Feb-1997 |
brian |
Reset ip_divert_ignore to zero immediately after use - also, set it in the first place, independent of whether sin->sin_port is set.
The result is that diverted packets that are being forwarded will be diverted once and only once on the way in (ip_input()) and again, once and only once on the way out (ip_output()) - twice in total. ICMP packets that don't contain a port will now also be diverted.
|
#
21673 |
|
14-Jan-1997 |
jkh |
Make the long-awaited change from $Id$ to $FreeBSD$
This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
|
#
20407 |
|
13-Dec-1996 |
wollman |
Convert the interface address and IP interface address structures to TAILQs. Fix places which referenced these for no good reason that I can see (the references remain, but were fixed to compile again; they are still questionable).
|
#
17072 |
|
10-Jul-1996 |
julian |
Adding changes to ipfw and the kernel to support ip packet diversion.. This stuff should not be too destructive if the IPDIVERT is not compiled in.. be aware that this changes the size of the ip_fw struct so ipfw needs to be recompiled to use it.. more changes coming to clean this up.
|