Cross Reference: /freebsd-9.3-release/sys/kern/uipc

History log of /freebsd-9.3-release/sys/kern/uipc_socket.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 267654	19-Jun-2014	gjb	Copy stable/9 to releng/9.3 as part of the 9.3-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-9.3-release
# 262579	27-Feb-2014	hiren	MFC r257472 Rate limit (to once per minute) "Listen queue overflow" message in sonewconn().
# 258099	13-Nov-2013	jhb	MFC 254699,255030: Use tvtohz() to convert a socket buffer timeout to a tick value rather than using a home-rolled version. The home-rolled version could result in shorter-than-requested sleeps. PR: kern/181416
# 254515	19-Aug-2013	andre	MFC a bundle of commits that bring autotuning to mbufs, maxfiles/sockets and maxusers to the 9-stable branch. It is committed as bundle because these patches build on each other and only provide the functionality in their entirety. Some are bug fixes to aspects of earlier commits. MFC r242029 (alfred): Allow autotune maxusers > 384 on 64 bit machines. MFC r242847 (alfred): Allow maxusers to scale on machines with large address space. MFC r243631 (andre): Base the mbuf related limits on the available physical memory or kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. MFC r243639 (andre): Complete r243631 by applying the remainder of kern_mbuf.c that got lost while merging into the commit tree. MFC r243668 (andre): Using a long is the wrong type to represent the realmem and maxmbufmem variable as they may overflow on i386/PAE and i386 with > 2GB RAM. MFC r243995, r243996, r243997 (pjd): Style cleanups, Make use of the fact that uma_zone_set_max(9) already returns actual limit set. MFC r244080 (andre): Prevent long type overflow of realmem calculation on ILP32 by forcing calculation to be in quad_t space. Fix style issue with second parameter to qmin(). MFC r245469 (alfred): Do not autotune ncallout to be greater than 18508. MFC r245575 (andre): Move the mbuf memory limit calculations from init_param2() to tunable_mbinit() where it is next to where it is used later. MFC r246207 (andre): Remove unused VM_MAX_AUTOTUNE_NMBCLUSTERS define. MFC r249843 (andre): Base the calculation of maxmbufmem in part on kmem_map size instead of kernel_map size to prevent kernel memory exhaustion by mbufs and a subsequent panic on physical page allocation failure. MFC r253204 (andre): Fix style issues, a typo in "kern.ipc.nmbufs" and correctly plave and expose the value of the tunable maxmbufmem as "kern.ipc.maxmbufmem" through sysctl. MFC r253207 (andre): Make use of the fact that uma_zone_set_max(9) already returns the rounded limit making a call to uma_zone_get_max(9) unnecessary. Tested by: alfred (iXsystems)
# 253035	08-Jul-2013	andre	MFC r241726: Move UMA socket zone initialization from uipc_domain.c to uipc_socket.c into one place next to its other related functions to avoid confusion. MFC r241729: Move socket UMA zone initialization functionality together into one place. MFC r241779: Tidy up somaxconn (accept queue limit) and related functions and move it together into one place.
# 252968	07-Jul-2013	tuexen	MFC r248172: Return an error if sctp_peeloff() fails because a socket can't be allocated. sctp_peeloff() uses sonewconn() also in cases where listen() wasn't called. So honor this use case.
# 252887	06-Jul-2013	jilles	MFC r250102: socket: Make shutdown() wake up a blocked accept(). A blocking accept (and some other operations) waits on &so->so_timeo. Once it wakes up, it will detect the SBS_CANTRCVMORE bit. The error from accept() is [ECONNABORTED] which is not the nicest one -- the thread calling accept() needs to know out-of-band what is happening. A spurious wakeup on so->so_timeo appears harmless (sleep retried) except when lingering on close (SO_LINGER, and in that case there is no descriptor to call shutdown() on) so this should be fairly safe. A shutdown() already woke up a blocked accept() for TCP sockets, but not for Unix domain sockets. This fix is generic for all domains. This patch was sent to -hackers@ and -net@ on April 5.
# 252843	05-Jul-2013	andre	MFC r241703: Remove double-wrapping of #ifdef ZERO_COPY_SOCKETS within zero copy specialized sosend_copyin() helper function. MFC r241704: Remove unnecessary includes from sosend_copyin() and fix a couple of style issues.
# 252785	05-Jul-2013	andre	MFC r242309: Fix a couple of soreceive_stream() issues. Submitted by: trociny
# 252783	05-Jul-2013	andre	MFC r243627, r243638: Fix a race on listen socket teardown where while draining the accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team.
# 252782	05-Jul-2013	andre	MFC r242306, r250365: Add logging for socket attach failures in sonewconn() during accept(2). Include the pointer to the PCB so it can be attributed to a particular application by corresponding it to "netstat -A" output.
# 241462	11-Oct-2012	np	MFC r233850: - Remove redundant call to pr_ctloutput from code that handles SO_SETFIB. - Add a check for errors during copyin while here.
# 240606	17-Sep-2012	trociny	MFC r240003, r240004: r240003: In soreceive_generic() when checking if the type of mbuf has changed check it for MT_CONTROL type too, otherwise the assertion "m->m_type == MT_DATA" below may be triggered by the following scenario: - the sender sends some data (MT_DATA) and then a file descriptor (MT_CONTROL); - the receiver calls recv(2) with a MSG_WAITALL asking for data larger than the receive buffer (uio_resid > hiwat). r240004: In soreceive_generic() remove the optimization for the case when MSG_WAITALL is set, and it is possible to do the entire receive operation at once if we block (resid <= hiwat). Actually it might make the recv(2) with MSG_WAITALL flag get stuck when there is enough space in the receiver buffer to satisfy the request but not enough to open the window closed previously due to the buffer being full. The issue can be reproduced using the following scenario: On the sender side do 2 send(2) requests: 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10); 2) data of size equal to SOBUF_SIZE. On the receiver side do 2 recv(2) requests with MSG_WAITALL flag set: 1) recv() data of SOBUF_SIZE / 10 size; 2) recv() data of SOBUF_SIZE size; We totally fill the receiver buffer with one SOBUF_SIZE/10 size request and partial SOBUF_SIZE request. When the first request is processed we get SOBUF_SIZE/10 free space. It is just enough to receive the rest of bytes for the second request, and soreceive_generic() blocks in the part that is a subject of this change waiting for the rest. But the window was closed when the buffer was filled and to avoid silly window syndrome it opens only when available space is larger than sb_hiwat/4 or maxseg. So it is stuck and pending data is only sent via TCP window probes. Discussed with: kib (long ago)
# 239978	01-Sep-2012	trociny	MFC r238085: Fix KASSERT message.
# 233353	23-Mar-2012	kib	MFC r231949: Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. MFC r232493: Remove unneeded cast to u_int. The values as small enough to fit into int, beside the use of MIN macro which performs type promotions. MFC r232494: Instead of incomplete handling of read(2)/write(2) return values that does not fit into registers, declare that we do not support this case using CTASSERT(), and remove endianess-unsafe code to split return value into td_retval. While there, change the style of the sysctl debug.iosize_max_clamp definition. MFC r232495: pipe_read(): change the type of size to int, and remove signed clamp. pipe_write(): change the type of desiredsize back to int, its value fits.
# 232805	10-Mar-2012	kib	MFC r232179: Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get the socket protocol number. PR: kern/162352
# 232804	10-Mar-2012	kib	MFC r232178: Remove apparently redundand checks for socket so_proto being non-NULL from sosetopt() and sogetopt().
# 232292	29-Feb-2012	bz	MFC r231852,232127: Merge multi-FIB IPv6 support. Extend the so far IPv4-only support for multiple routing tables (FIBs) introduced in r178888 to IPv6 providing feature parity. This includes an extended rtalloc(9) KPI for IPv6, the necessary adjustments to the network stack, and user land support as in netstat. Sponsored by: Cisco Systems, Inc.
# 225736	22-Sep-2011	kensmith	Copy head to stable/9 as part of 9.0-RELEASE release cycle. Approved by: re (implicit)
# 225177	25-Aug-2011	attilio	Fix a deficiency in the selinfo interface: If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks
# 223863	08-Jul-2011	andre	In the experimental soreceive_stream(): o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF is correctly returned on a remote connection close. o In the non-blocking socket test compare SS_NBIO against the so->so_state field instead of the incorrect sb->sb_state field. o Simplify the ENOTCONN test by removing cases that can't occur. Submitted by: trociny (with some further tweaks by committer) Tested by: trociny
# 223839	07-Jul-2011	andre	Remove the TCP_SORECEIVE_STREAM compile time option. The use of soreceive_stream() for TCP still has to be enabled with the loader tuneable net.inet.tcp.soreceive_stream. Suggested by: trociny and others
# 222454	29-May-2011	trociny	In soreceive_generic(), if MSG_WAITALL is set but the request is larger than the receive buffer, we have to receive in sections. When notifying the protocol that some data has been drained the lock is released for a moment. Returning we block waiting for the rest of data. There is a race, when data could arrive while the lock was released and then the connection stalls in sbwait. Fix this by checking for data before blocking and skip blocking if there are some. PR: kern/154504 Reported by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Tested by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Reviewed by: rwatson Approved by: kib (co-mentor) MFC after: 2 weeks
# 218757	16-Feb-2011	bz	Mfp4 CH=177274,177280,177284-177285,177297,177324-177325 VNET socket push back: try to minimize the number of places where we have to switch vnets and narrow down the time we stay switched. Add assertions to the socket code to catch possibly unset vnets as seen in r204147. While this reduces the number of vnet recursion in some places like NFS, POSIX local sockets and some netgraph, .. recursions are impossible to fix. The current expectations are documented at the beginning of uipc_socket.c along with the other information there. Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb Tested by: zec Tested by: Mikolaj Golub (to.my.trociny gmail.com) MFC after: 2 weeks
# 218627	12-Feb-2011	deischen	Allow the SO_SETFIB socket option to select the default (0) routing table. Reviewed by: julian
# 218559	11-Feb-2011	bz	Mfp4 CH=177255: Make VNET_ASSERT() available with either VNET_DEBUG or INVARIANTS. Change the syntax to match KASSERT() to allow more flexible panic messages rather than having a printf with hardcoded arguments before panic. Adjust the few assertions we have to the new format (and enhance the output). Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb MFC after: 2 weeks
# 215178	12-Nov-2010	luigi	This commit implements the SO_USER_COOKIE socket option, which lets you tag a socket with an uint32_t value. The cookie can then be used by the kernel for various purposes, e.g. setting the skipto rule or pipe number in ipfw (this is the reason SO_USER_COOKIE has been implemented; however there is nothing ipfw-specific in its implementation). The ipfw-related code that uses the optopn will be committed separately. This change adds a field to 'struct socket', but the struct is not part of any driver or userland-visible ABI so the change should be harmless. See the discussion at http://lists.freebsd.org/pipermail/freebsd-ipfw/2009-October/004001.html Idea and code from Paul Joe, small modifications and manpage changes by myself. Submitted by: Paul Joe MFC after: 1 week
# 212822	18-Sep-2010	rwatson	With reworking of the socket life cycle in 7.x, the need for a "sotryfree()" was eliminated: all references to sockets are explicitly managed by sorele() and the protocols. As such, garbage collect sotryfree(), and update sofree() comments to make the new world order more clear. MFC after: 3 days Reported by: Anuranjan Shukla <anshukla at juniper dot net>
# 211030	07-Aug-2010	tuexen	Fix a bug where MSG_TRUNC was not returned in all necessary cases for SOCK_DGRAM socket. MSG_TRUNC was only returned when some mbufs could not be copied to the application. If some data was left in the last mbuf, it was correctly discarded, but MSG_TRUNC was not set. Reviewed by: bz MFC after: 3 weeks
# 208601	27-May-2010	rwatson	When close() is called on a connected socket pair, SO_ISCONNECTED might be set but be cleared before the call to sodisconnect(). In this case, ENOTCONN is returned: suppress this error rather than returning it to userspace so that close() doesn't report an error improperly. PR: kern/144061 Reported by: Matt Reimer <mreimer at vpop.net>, Nikolay Denev <ndenev at gmail.com>, Mikolaj Golub <to.my.trociny at gmail.com> MFC after: 3 days
# 205014	11-Mar-2010	nwhitehorn	Provide groundwork for 32-bit binary compatibility on non-x86 platforms, for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32 option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts of the kernel and enhances the freebsd32 compatibility code to support big-endian platforms. Reviewed by: kib, jhb
# 204147	20-Feb-2010	bz	Set curvnet earlier so that it also covers calls to sodisconnect(), which before were possibly panicing the system in ULP code in the VIMAGE case. Submitted by: Igor (igor ispsystem.com) MFC after: 5 days
# 197720	02-Oct-2009	rwatson	Don't comment on stream socket handling in sosend_dgram, since that's not handled. MFC after: 3 weeks
# 197236	15-Sep-2009	andre	-Put the optimized soreceive_stream() under a compile time option called TCP_SORECEIVE_STREAM for the time being. Requested by: brooks Once compiled in make it easily switchable for testers by using a tuneable net.inet.tcp.soreceive_stream and a corresponding read-only sysctl to report the current state. Suggested by: rwatson MFC after: 2 days -This line, and those below, will be ignored-- > Description of fields to fill in above: 76 columns --\| > PR: If a GNATS PR is affected by the change. > Submitted by: If someone else sent in the change. > Reviewed by: If someone else reviewed your modification. > Approved by: If you needed approval for this commit. > Obtained from: If the change is from a third party. > MFC after: N [day[s]\|week[s]\|month[s]]. Request a reminder email. > Security: Vulnerability reference (one per line) or description. > Empty fields above will be automatically removed. M sys/conf/options M sys/kern/uipc_socket.c M sys/netinet/tcp_subr.c M sys/netinet/tcp_usrreq.c
# 197134	12-Sep-2009	rwatson	Use C99 initialization for struct filterops. Obtained from: Mac OS X Sponsored by: Apple Inc. MFC after: 3 weeks
# 196556	25-Aug-2009	jilles	Fix poll() on half-closed sockets, while retaining POLLHUP for fifos. This reverts part of r196460, so that sockets only return POLLHUP if both directions are closed/error. Fifos get POLLHUP by closing the unused direction immediately after creating the sockets. The tools/regression/poll/*poll.c tests now pass except for two other things: - if POLLHUP is returned, POLLIN is always returned as well instead of only when there is data left in the buffer to be read - fifo old/new reader distinction does not work the way POSIX specs it Reviewed by: kib, bde
# 196460	23-Aug-2009	kib	Fix the conformance of poll(2) for sockets after r195423 by returning POLLHUP instead of POLLIN for several cases. Now, the tools/regression/poll results for FreeBSD are closer to that of the Solaris and Linux. Also, improve the POSIX conformance by explicitely clearing POLLOUT when POLLHUP is reported in pollscan(), making the fix global. Submitted by: bde Reviewed by: rwatson MFC after: 1 week
# 196019	01-Aug-2009	rwatson	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)
# 195922	28-Jul-2009	julian	Somewhere along the line accept sockets stopped honoring the FIB selected for them. Fix this. Reviewed by: ambrisko Approved by: re (kib) MFC after: 3 days
# 195769	19-Jul-2009	rwatson	Normalize field naming for struct vnet, fix two debugging printfs that print them. Reviewed by: bz Approved by: re (kensmith, kib)
# 195423	07-Jul-2009	kib	Fix poll(2) and select(2) for named pipes to return "ready for read" when all writers, observed by reader, exited. Use writer generation counter for fifo, and store the snapshot of the fifo generation in the f_seqcount field of struct file, that is otherwise unused for fifos. Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is equal to fifo' fi_wgen, and revert r89376. Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them. Note that the patch does not fix not returning POLLHUP for fifos. PR: kern/94772 Submitted by: bde (original version) Reviewed by: rwatson, jilles Approved by: re (kensmith) MFC after: 6 weeks (might be)
# 194672	22-Jun-2009	andre	Add soreceive_stream(), an optimized version of soreceive() for stream (TCP) sockets. It is functionally identical to generic soreceive() but has a number stream specific optimizations: o does only one sockbuf unlock/lock per receive independent of the length of data to be moved into the uio compared to soreceive() which unlocks/locks per mbuf. o uses m_mbuftouio() instead of its own copy(out) variant. o much more compact code flow as a large number of special cases is removed. o much improved reability. It offers significantly reduced CPU usage and lock contention when receiving fast TCP streams. Additional gains are obtained when the receiving application is using SO_RCVLOWAT to batch up some data before a read (and wakeup) is done. This function was written by "reverse engineering" and is not just a stripped down variant of soreceive(). It is not yet enabled by default on TCP sockets. Instead it is commented out in the protocol initialization in tcp_usrreq.c until more widespread testing has been done. Testers, especially with 10GigE gear, are welcome. MFP4: r164817 //depot/user/andre/soreceive_stream/
# 194252	15-Jun-2009	jamie	Get vnets from creds instead of threads where they're available, and from passed threads instead of curthread. Reviewed by: zec, julian Approved by: bz (mentor)
# 193951	10-Jun-2009	kib	Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland
# 193511	05-Jun-2009	rwatson	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# 193332	02-Jun-2009	rwatson	Add internal 'mac_policy_count' counter to the MAC Framework, which is a count of the number of registered policies. Rather than unconditionally locking sockets before passing them into MAC, lock them in the MAC entry points only if mac_policy_count is non-zero. This avoids locking overhead for a number of socket system calls when no policies are registered, eliminating measurable overhead for the MAC Framework for the socket subsystem when there are no active policies. Possibly socket locks should be acquired by policies if they are required for socket labels, which would further avoid locking overhead when there are policies but they don't require labeling of sockets, or possibly don't even implement socket controls. Obtained from: TrustedBSD Project
# 193272	01-Jun-2009	jhb	Rework socket upcalls to close some races with setup/teardown of upcalls. - Each socket upcall is now invoked with the appropriate socket buffer locked. It is not permissible to call soisconnected() with this lock held; however, so socket upcalls now return an integer value. The two possible values are SU_OK and SU_ISCONNECTED. If an upcall returns SU_ISCONNECTED, then the soisconnected() will be invoked on the socket after the socket buffer lock is dropped. - A new API is provided for setting and clearing socket upcalls. The API consists of soupcall_set() and soupcall_clear(). - To simplify locking, each socket buffer now has a separate upcall. - When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from the receive socket buffer automatically. Note that a SO_SND upcall should never return SU_ISCONNECTED. - All this means that accept filters should now return SU_ISCONNECTED instead of calling soisconnected() directly. They also no longer need to explicitly clear the upcall on the new socket. - The HTTP accept filter still uses soupcall_set() to manage its internal state machine, but other accept filters no longer have any explicit knowlege of socket upcall internals aside from their return value. - The various RPC client upcalls currently drop the socket buffer lock while invoking soreceive() as a temporary band-aid. The plan for the future is to add a new flag to allow soreceive() to be called with the socket buffer locked. - The AIO callback for socket I/O is now also invoked with the socket buffer locked. Previously sowakeup() would drop the socket buffer lock only to call aio_swake() which immediately re-acquired the socket buffer lock for the duration of the function call. Discussed with: rwatson, rmacklem
# 191917	08-May-2009	zec	A NOP change: style / whitespace cleanup of the noise that slipped into r191816. Spotted by: bz Approved by: julian (mentor) (an earlier version of the diff)
# 191816	05-May-2009	zec	Change the curvnet variable from a global const struct vnet , previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged. This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_ macros expand to whitespace. The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another. The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions. This change also introduces a DDB subcommand to show the list of all vnet instances. Approved by: julian (mentor)
# 191688	30-Apr-2009	zec	Permit buiding kernels with options VIMAGE, restricted to only a single active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net ) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_ macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_ family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor)
# 188146	05-Feb-2009	jamie	Don't allow creating a socket with a protocol family that the current jail doesn't support. This involves a new function prison_check_af, like prison_check_ip[46] but that checks only the family. With this change, most of the errors generated by jailed sockets shouldn't ever occur, at least until jails are changeable. Approved by: bz (mentor)
# 188123	04-Feb-2009	rwatson	Remove written-to but never read local variable 'offset' from soreceive_dgram(). Submitted by: Christoph Mallon <christoph dot mallon at gmx dot de> MFC after: 1 week
# 185893	10-Dec-2008	bz	Make sure nmbclusters are initialized before maxsockets by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE before the init_maxsockets() SYSINT at SI_ORDER_ANY. Reviewed by: rwatson, zec Sponsored by: The FreeBSD Foundation MFC after: 4 weeks
# 185892	10-Dec-2008	bz	Style changes only. Put the return type on an extra line[1] and add an empty line at the beginning as we do not have any local variables. Submitted by: rwatson [1] Reviewed by: rwatson MFC after: 4 weeks
# 185435	29-Nov-2008	bz	MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
# 185169	22-Nov-2008	kib	Add sv_flags field to struct sysentvec with intention to provide description of the ABI of the currently executing image. Change some places to test the flags instead of explicit comparing with address of known sysentvec structures to determine ABI features. Discussed with: dchagin, imp, jhb, peter
# 185101	19-Nov-2008	julian	Fix a scope problem in the multiple routing table code that stopped the SO_SETFIB socket option from working correctly. Obtained from: Ironport MFC after: 3 days
# 183963	16-Oct-2008	kmacy	make sure that SO_NO_DDP and SO_NO_OFFLOAD get passed in correctly PR: 127360 MFC after: 3 days
# 183675	07-Oct-2008	rwatson	In soreceive_dgram, when a 0-length buffer is passed into recv(2) and no data is ready, return 0 rather than blocking or returning EAGAIN. This is consistent with the behavior of soreceive_generic (soreceive) in earlier versions of FreeBSD, and restores this behavior for UDP. Discussed with: jhb, sam MFC after: 3 days
# 183664	07-Oct-2008	rwatson	Remove temporary debugging KASSERT's introduced to detect protocols improperly invoking sosend(), soreceive(), and sopoll() instead of attach either specialized or _generic() versions of those functions to their pru_sosend, pru_soreceive, and pru_sopoll protosw methods. MFC after: 3 days
# 183518	01-Oct-2008	jhb	Wait until after dropping the receive socket buffer lock to allocate space to store the socket address stored in the first mbuf in a packet chain. This reduces contention on the lock and CPU system time in certain UDP workloads. Tested by: ps Reviewed by: rwatson MFC after: 1 week
# 183512	01-Oct-2008	rwatson	Various cleanups for soreceive_dgram(): - Update or remove comments that were left over from the original soreceive_generic() implementation. Quite a few were misleading in the context of the new code. - Since soreceive_dgram() has a simpler structure, replace several gotos with a while loop making the invariants more clear. - In the blocking while loop, don't try to handle cases incompatible with the loop invariant (since m is always NULL, don't check for and handle non-NULL). - Don't drop and re-acquire the socket buffer lock unnecessarily after sbwait() returns, which may help reduce lock contention (etc). - Assume PR_ATOMIC since we assert it at the top of the function. MFC after: 3 days
# 183503	30-Sep-2008	jhb	Update the function name in several assertions in soreceive_dgram(). Approved by: rwatson MFC after: 3 days
# 182682	02-Sep-2008	rwatson	Remove XXXRW in soreceive_dgram that proves unnecessary. Remove unused orig_resid variable in soreceive_dgram. Submitted by: alfred X-MFC with: soreceive_dgram (r180198, r180211)
# 180641	20-Jul-2008	kmacy	Add accessor functions for socket fields. MFC after: 1 week
# 180211	03-Jul-2008	rwatson	Update copyright date in light of soreceive_dgram(9).
# 180198	02-Jul-2008	rwatson	Add soreceive_dgram(9), an optimized socket receive function for use by datagram-only protocols, such as UDP. This version removes use of sblock(), which is not required due to an inability to interlace data improperly with datagrams, as well as avoiding some of the larger loops and state management that don't apply on datagram sockets. This is experimental code, so hook it up only for UDPv4 for testing; if there are problems we may need to revise it or turn it off by default, but it offers significant performance improvements for threaded UDP applications such as BIND9, nsd, and memcached using UDP. Tested by: kris, ps
# 178888	09-May-2008	julian	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco
# 178200	14-Apr-2008	rrs	Add pru_flush routine so a transport can flush itself during Shutdown MFC after: 1 week
# 177599	25-Mar-2008	ru	Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT. Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true since the advent of MBUMA. Reviewed by: arch There are ongoing disputes as to whether we want to switch to directly using UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
# 177380	19-Mar-2008	sobomax	Revert previous change - it appears that the limit I was hitting was a maxsockets limit, not maxfiles limit. The question remains why those limits are handled differently (with error code for maxfiles but with sleep for maxsokets), but those would be addressed in a separate commit if necessary. Requested by: rwhatson, jeff
# 177232	16-Mar-2008	sobomax	Properly set size of the file_zone to match kern.maxfiles parameter. Otherwise the parameter is no-op, since zone by default limits number of descriptors to some 12K entries. Attempt to allocate more ends up sleeping on zonelimit. MFC after: 2 weeks
# 175968	04-Feb-2008	rwatson	Further clean up sorflush: - Expose sbrelease_internal(), a variant of sbrelease() with no expectations about the validity of locks in the socket buffer. - Use sbrelease_internel() in sorflush(), and as a result avoid intializing and destroying a socket buffer lock for the temporary stack copy of the actual buffer, asb. - Add a comment indicating why we do what we do, and remove an XXX since things have gotten less ugly in sorflush() lately. This makes socket close cleaner, and possibly also marginally faster. MFC after: 3 weeks
# 175845	31-Jan-2008	rwatson	Correct two problems relating to sorflush(), which is called to flush read socket buffers in shutdown() and close(): - Call socantrcvmore() before sblock() to dislodge any threads that might be sleeping (potentially indefinitely) while holding sblock(), such as a thread blocked in recv(). - Flag the sblock() call as non-interruptible so that a signal delivered to the thread calling sorflush() doesn't cause sblock() to fail. The sblock() is required to ensure that all other socket consumer threads have, in fact, left, and do not enter, the socket buffer until we're done flushin it. To implement the latter, change the 'flags' argument to sblock() to accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK flag. When SBL_NOINTR is set, it forces a non-interruptible sx acquisition, regardless of the setting of the disposition of SB_NOINTR on the socket buffer; without this change it would be possible for another thread to clear SB_NOINTR between when the socket buffer mutex is released and sblock() is invoked. Reviewed by: bz, kmacy Reported by: Jos Backus <jos at catnook dot com>
# 172930	24-Oct-2007	rwatson	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 170289	04-Jun-2007	dwmalone	Despite several examples in the kernel, the third argument of sysctl_handle_int is not sizeof the int type you want to export. The type must always be an int or an unsigned int. Remove the instances where a sizeof(variable) is passed to stop people accidently cut and pasting these examples. In a few places this was sysctl_handle_int was being used on 64 bit types, which would truncate the value to be exported. In these cases use sysctl_handle_quad to export them and change the format to Q so that sysctl(1) can still print them.
# 170174	31-May-2007	jeff	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
# 169624	16-May-2007	rwatson	Generally migrate to ANSI function headers, and remove 'register' use.
# 169375	08-May-2007	yongari	Add missing socket buffer unlock before returning to userland. Reviewed by: rwatson
# 169236	03-May-2007	rwatson	sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags on each socket buffer with the socket buffer's mutex. This sleep lock is used to serialize I/O on sockets in order to prevent I/O interlacing. This change replaces the custom sleep lock with an sx(9) lock, which results in marginally better performance, better handling of contention during simultaneous socket I/O across multiple threads, and a cleaner separation between the different layers of locking in socket buffers. Specifically, the socket buffer mutex is now solely responsible for serializing simultaneous operation on the socket buffer data structure, and not for I/O serialization. While here, fix two historic bugs: (1) a bug allowing I/O to be occasionally interlaced during long I/O operations (discovere by Isilon). (2) a bug in which failed non-blocking acquisition of the socket buffer I/O serialization lock might be ignored (discovered by sam). SCTP portion of this patch submitted by rrs.
# 167902	26-Mar-2007	rwatson	Following movement of functions from uipc_socket2.c to uipc_socket.c and uipc_sockbuf.c, clean up and update comments.
# 167895	26-Mar-2007	rwatson	Complete removal of uipc_socket2.c by moving the last few functions to other C files: - Move sbcreatecontrol() and sbtoxsockbuf() to uipc_sockbuf.c. While sbcreatecontrol() is really an mbuf allocation routine, it does its work with awareness of the layout of socket buffer memory. - Move pru_*() protocol switch stubs to uipc_socket.c where the non-stub versions of several of these functions live. Likewise, move socket state transition calls (soisconnecting(), etc) to uipc_socket.c. Moveo sodupsockaddr() and sotoxsocket().
# 167799	22-Mar-2007	glebius	Move the dom_dispose and pru_detach calls in sofree() earlier. Only after calling pru_detach we can be absolutely sure, that we don't have any references to the socket in the stack. This closes race between lockless sbdestroy() and data arriving on socket. Reviewed by: rwatson
# 167489	12-Mar-2007	jhb	- Use m_gethdr(), m_get(), and m_clget() instead of the macros in sosend_copyin(). - Use M_WAITOK instead of M_TRYWAIT in sosend_copyin(). - Don't check for NULL from M_WAITOK and return ENOBUFS. M_WAITOK/M_TRYWAIT allocations don't fail with NULL. Reviewed by: andre Requested by: andre (2)
# 167014	26-Feb-2007	ru	Don't block on the socket zone limit during the socket() call which can easily lock up a system otherwise; instead, return ENOBUFS as documented in a manpage, thus reverting us to the FreeBSD 4.x behavior. Reviewed by: rwatson MFC after: 2 weeks
# 166745	15-Feb-2007	rwatson	Rename somaxconn_sysctl() to sysctl_somaxconn() so that I will be able to claim that sofoo() functions all accept a socket as their first argument.
# 166447	03-Feb-2007	bms	Diff reduction with RELENG_6, style(9): Remove unnecessary brace; && should be on end of line. No functional changes.
# 166404	01-Feb-2007	andre	Generic socket buffer auto sizing support, header defines, flag inheritance. MFC after: 1 month
# 166171	22-Jan-2007	andre	Unbreak writes of 0 bytes. Zero byte writes happen when only ancillary control data but no payload data is passed. Change m_uiotombuf() to return at least one empty mbuf if the requested length was zero. Add comment to sosend_dgram and sosend_generic(). Diagnoses by: jhb Regression test by: rwatson Pointy hat to. andre
# 165889	08-Jan-2007	rwatson	Canonicalize copyrights in some files I hold copyrights on: - Sort by date in license blocks, oldest copyright first. - All rights reserved after all copyrights, not just the first. - Use (c) to be consistent with other entries. MFC after: 3 days
# 165503	23-Dec-2006	bms	Drop all received data mbufs from a socket's queue if the MT_SONAME mbuf is dropped, to preserve the invariant in the PR_ADDR case. Add a regression test to detect this condition, but do not hook it up to the build for now. PR: kern/38495 Submitted by: James Juran Reviewed by: sam, rwatson Obtained from: NetBSD MFC after: 2 weeks
# 164530	22-Nov-2006	mohans	Fix a race in soclose() where connections could be queued to the listening socket after the pass that cleans those queues. This results in these connections being orphaned (and leaked). The fix is to clean up the so queues after detaching the socket from the protocol. Thanks to ups and jhb for discussions and a thorough code review.
# 163916	02-Nov-2006	andre	Use the improved m_uiotombuf() function instead of home grown sosend_copyin() to do the userland to kernel copying in sosend_generic() and sosend_dgram(). sosend_copyin() is retained for ZERO_COPY_SOCKETS which are not yet supported by m_uiotombuf(). Benchmaring shows significant improvements (95% confidence): 66% less cpu (or 2.9 times better) with new sosend vs. old sosend (non-TSO) 65% less cpu (or 2.8 times better) with new sosend vs. old sosend (TSO) (Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back at 1000Base-TX full duplex.) Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 month
# 163606	22-Oct-2006	rwatson	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 162554	22-Sep-2006	bms	Fix a case where socket I/O atomicity is violated due to not dropping the entire record when a non-data mbuf is removed in the soreceive() path. This only triggers a panic directly when compiled with INVARIANTS. PR: 38495 Submitted by: James Juran MFC after: 1 week
# 162265	13-Sep-2006	pjd	Fix a lock leak in an error case. Reported by: netchild Reviewed by: rwatson
# 162204	10-Sep-2006	andre	New sockets created by incoming connections into listen sockets should inherit all settings and options except listen specific options. Add the missing send/receive timeouts and low watermarks. Remove inheritance of the field so_timeo which is unused. Noticed by: phk Reviewed by: rwatson Sponsored by: TCP/IP Optimization Fundraise 2005 MFC after: 3 days
# 161440	18-Aug-2006	gnn	Fix a kernel panic based on receiving an ICMPv6 Packet too Big message. PR: 99779 Submitted by: Jinmei Tatuya Reviewed by: clement, rwatson MFC after: 1 week
# 161230	11-Aug-2006	rwatson	Before performing a sodealloc() when pru_attach() fails, assert that the socket refcount remains 1, and then drop to 0 before freeing the socket. PR: 101763 Reported by: Gleb Kozyrev <gkozyrev at ukr dot net>
# 160933	02-Aug-2006	rwatson	Move destroying kqueue state from above pru_detach to below it in sofree(), as a number of protocols expect to be able to call soisdisconnected() during detach. That may not be a good assumption, but until I'm sure if it's a good assumption or not, allow it.
# 160896	01-Aug-2006	rwatson	Move updated of 'numopensockets' from bottom of sodealloc() to the top, eliminating a second set of identical mutex operations at the bottom. This allows brief exceeding of the max sockets limit, but only by sockets in the last stages of being torn down.
# 160875	01-Aug-2006	rwatson	Reimplement socket buffer tear-down in sofree(): as the socket is no longer referenced by other threads (hence our freeing it), we don't need to set the can't send and can't receive flags, wake up the consumers, perform two levels of locking, etc. Implement a fast-path teardown, sbdestroy(), which flushes and releases each socket buffer. A manual dom_dispose of the receive buffer is still required explicitly to GC any in-flight file descriptors, etc, before flushing the buffer. This results in a 9% UP performance improvement and 16% SMP performance improvement on a tight loop of socket();close(); in micro-benchmarking, but will likely also affect CPU-bound macro-benchmark performance.
# 160619	24-Jul-2006	rwatson	soreceive_generic(), and sopoll_generic(). Add new functions sosend(), soreceive(), and sopoll(), which are wrappers for pru_sosend, pru_soreceive, and pru_sopoll, and are now used univerally by socket consumers rather than either directly invoking the old so*() functions or directly invoking the protocol switch method (about an even split prior to this commit). This completes an architectural change that was begun in 1996 to permit protocols to provide substitute implementations, as now used by UDP. Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to perform these operations on sockets -- in particular, distributed file systems and socket system calls. Architectural head nod: sam, gnn, wollman
# 160601	23-Jul-2006	rwatson	Update various uipc_socket.c comments, and reformat others.
# 160549	21-Jul-2006	rwatson	Change semantics of socket close and detach. Add a new protocol switch function, pru_close, to notify protocols that the file descriptor or other consumer of a socket is closing the socket. pru_abort is now a notification of close also, and no longer detaches. pru_detach is no longer used to notify of close, and will be called during socket tear-down by sofree() when all references to a socket evaporate after an earlier call to abort or close the socket. This means detach is now an unconditional teardown of a socket, whereas previously sockets could persist after detach of the protocol retained a reference. This faciliates sharing mutexes between layers of the network stack as the mutex is required during the checking and removal of references at the head of sofree(). With this change, pru_detach can now assume that the mutex will no longer be required by the socket layer after completion, whereas before this was not necessarily true. Reviewed by: gnn
# 160415	16-Jul-2006	rwatson	Change comment on soabort() to more accurately describe how/when soabort() is used. Remove trailing white space.
# 160281	11-Jul-2006	rwatson	Several protocol switch functions (pru_abort, pru_detach, pru_sosetlabel) return void, so don't implement no-op versions of these functions. Instead, consistently check if those switch pointers are NULL before invoking them.
# 160280	11-Jul-2006	rwatson	When pru_attach() fails, call sodealloc() on the socket rather than using sorele() and the full tear-down path. Since protocol state allocation failed, this is not required (and is arguably undesirable). This matches the behavior of sonewconn() under the same circumstances.
# 159752	18-Jun-2006	rwatson	When retrieving SO_ERROR via getsockopt(), hold the socket lock around the retrieval and replacement with 0. MFC after: 1 week
# 159481	10-Jun-2006	rwatson	Move some functions and definitions from uipc_socket2.c to uipc_socket.c: - Move sonewconn(), which creates new sockets for incoming connections on listen sockets, so that all socket allocate code is together in uipc_socket.c. - Move 'maxsockets' and associated sysctls to uipc_socket.c with the socket allocation code. - Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it to sysctl.h and remove lots of scattered implementations in various IPC modules. - Sort sodealloc() after soalloc() in uipc_socket.c for dependency order reasons. Statisticize soalloc() and sodealloc() as they are now required only in uipc_socket.c, and are internal to the socket implementation. After this change, socket allocation and deallocation is entirely centralized in one file, and uipc_socket2.c consists entirely of socket buffer manipulation and default protocol switch functions. MFC after: 1 month
# 159417	08-Jun-2006	rwatson	Rearrange code in soalloc() so that it's less indented by returning early if uma_zalloc() from the socket zone fails. No functional change. MFC after: 1 week
# 157987	23-Apr-2006	rwatson	Assert that sockets passed into soabort() not be SQ_COMP or SQ_INCOMP, since that removal should have been done a layer up. MFC after: 3 months
# 157982	23-Apr-2006	rwatson	Add missing 'not' to SQ_COMP comment. MFC after: 3 months
# 157981	23-Apr-2006	rwatson	Move handling of SQ_COMP exception case in sofree() to the top of the function along with the remainder of the reference checking code. Move comment from body to header with remainder of comments. Inclusion of a socket in a completed connection queue counts as a true reference, and should not be handled as an under-documented edge case. MFC after: 3 months
# 157370	01-Apr-2006	rwatson	Chance protocol switch method pru_detach() so that it returns void rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket. soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals. Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it. In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach. netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic. MFC after: 3 months
# 157366	01-Apr-2006	rwatson	Change protocol switch pru_abort() API so that it returns void rather than an int, as an error here is not meaningful. Modify soabort() to unconditionally free the socket on the return of pru_abort(), and modify most protocols to no longer conditionally free the socket, since the caller will do this. This commit likely leaves parts of netinet and netinet6 in a situation where they may panic or leak memory, as they have not are not fully updated by this commit. This will be corrected shortly in followup commits to these components. MFC after: 3 months
# 157359	01-Apr-2006	rwatson	Assert so->so_pcb is NULL in sodealloc() -- the protocol state should not be present at this point. We will eventually remove this assert because the socket layer should never look at so_pcb, but for now it's a useful debugging tool. MFC after: 3 months
# 157358	01-Apr-2006	rwatson	Add a somewhat sizable comment documenting the semantics of various kernel socket calls relating to the creation and destruction of sockets. This will eventually form the foundation of socket(9), but is currently in too much flux to do so. MFC after: 3 months
# 156763	16-Mar-2006	rwatson	Change soabort() from returning int to returning void, since all consumers ignore the return value, soabort() is required to succeed, and protocols produce errors here to report multiple freeing of the pcb, which we hope to eliminate.
# 156738	15-Mar-2006	rwatson	As with socket consumer references (so_count), make sofree() return without GC'ing the socket if a strong protocol reference to the socket is present (SS_PROTOREF).
# 155573	12-Feb-2006	rwatson	Improve consistency of return() style. MFC after: 3 days
# 154294	13-Jan-2006	rwatson	Add sosend_dgram(), a greatly reduced and simplified version of sosend() intended for use solely with atomic datagram socket types, and relies on the previous break-out of sosend_copyin(). Changes to allow UDP to optionally use this instead of sosend() will be committed as a follow-up.
# 152938	29-Nov-2005	jhb	Fix snderr() to not leak the socket buffer lock if an error occurs in sosend(). Robert accidentally changed the snderr() macro to jump to the out label which assumes the lock is already released rather than the release label which drops the lock in his previous change to sosend(). This should fix the recent panics about returning from write(2) with the socket lock held and the most recent LOR on current@.
# 152907	28-Nov-2005	rwatson	Move zero copy statistics structure before sosend_copyin(). MFC after: 1 month Reported by: tinderbox, sam
# 152894	28-Nov-2005	rwatson	Break out functionality in sosend() responsible for building mbuf chains and copying in mbufs from the body of the send logic, creating a new function sosend_copyin(). This changes makes sosend() almost readable, and will allow the same logic to be used by tailored socket send routines. MFC after: 1 month Reviewed by: andre, glebius
# 151967	02-Nov-2005	andre	Retire MT_HEADER mbuf type and change its users to use MT_DATA. Having an additional MT_HEADER mbuf type is superfluous and redundant as nothing depends on it. It only adds a layer of confusion. The distinction between header mbuf's and data mbuf's is solely done through the m->m_flags M_PKTHDR flag. Non-native code is not changed in this commit. For compatibility MT_HEADER is mapped to MT_DATA. Sponsored by: TCP/IP Optimization Fundraise 2005
# 151888	30-Oct-2005	rwatson	Push the assignment of a new or updated so_qlimit from solisten() following the protocol pru_listen() call to solisten_proto(), so that it occurs under the socket lock acquisition that also sets SO_ACCEPTCONN. This requires passing the new backlog parameter to the protocol, which also allows the protocol to be aware of changes in queue limit should it wish to do something about the new queue limit. This continues a move towards the socket layer acting as a library for the protocol. Bump __FreeBSD_version due to a change in the in-kernel protocol interface. This change has been tested with IPv4 and UNIX domain sockets, but not other protocols.
# 151728	27-Oct-2005	ps	Allow 32bit get/setsockopt with SO_SNDTIMEO or SO_RECVTIMEO to work.
# 150302	18-Sep-2005	rwatson	Add three new read-only socket options, which allow regression tests and other applications to query the state of the stack regarding the accept queue on a listen socket: SO_LISTENQLIMIT Return the value of so_qlimit (socket backlog) SO_LISTENQLEN Return the value of so_qlen (complete sockets) SO_LISTENINCQLEN Return the value of so_incqlen (incomplete sockets) Minor white space tweaks to existing socket options to make them consistent. Discussed with: andre MFC after: 1 week
# 150282	18-Sep-2005	rwatson	Fix spelling in a comment. MFC after: 3 days
# 150155	15-Sep-2005	maxim	Backout rev. 1.246, it breaks code uses shutdown(2) on non-connected sockets. Pointed out by: rwatson
# 150152	15-Sep-2005	maxim	o Return ENOTCONN when shutdown(2) on non-connected socket. PR: kern/84761 Submitted by: James Juran R-test: tools/regression/sockets/shutdown MFC after: 1 month
# 149819	06-Sep-2005	glebius	In soreceive(), when a first mbuf is removed from socket buffer use sockbuf_pushsync(). Previous manipulation could lead to an inconsistent mbuf. Reviewed by: rwatson
# 148629	01-Aug-2005	kbyanc	Make getsockopt(..., SOL_SOCKET, SO_ACCEPTCONN, ...) work per IEEE Std 1003.1 (POSIX).
# 148474	28-Jul-2005	gnn	Fix for PR 83885. Make sure that there actually is a next packet before setting nextrecord to that field. PR: 83885 Submitted by: hirose@comm.yamaha.co.jp Obtained from: Patch suggested in the PR MFC after: 1 week
# 147730	01-Jul-2005	ssouhlal	Fix the recent panics/LORs/hangs created by my kqueue commit by: - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone
# 147256	10-Jun-2005	brooks	Stop embedding struct ifnet at the top of driver softcs. Instead the struct ifnet or the layer 2 common structure it was embedded in have been replaced with a struct ifnet pointer to be filled by a call to the new function, if_alloc(). The layer 2 common structure is also allocated via if_alloc() based on the interface type. It is hung off the new struct ifnet member, if_l2com. This change removes the size of these structures from the kernel ABI and will allow us to better manage them as interfaces come and go. Other changes of note: - Struct arpcom is no longer referenced in normal interface code. Instead the Ethernet address is accessed via the IFP2ENADDR() macro. To enforce this ac_enaddr has been renamed to _ac_enaddr. - The second argument to ether_ifattach is now always the mac address from driver private storage rather than sometimes being ac_enaddr. Reviewed by: sobomax, sam
# 147193	09-Jun-2005	scottl	Drat! Committed from the wrong branch. Restore HEAD to its previous goodness.
# 147192	09-Jun-2005	scottl	Back out 1.68.2.26. It was a mis-guided change that was already backed out of HEAD and should not have been MFC'd. This will restore UDP socket functionality, which will correct the recent NFS problems. Submitted by: rwatson
# 147009	05-Jun-2005	gallatin	Allow sends sent from non page-aligned userspace addresses to be considered for zero-copy sends. Reviewed by: alc Submitted by: Romer Gil at Rice University
# 143463	12-Mar-2005	rwatson	Move the logic implementing retrieval of the SO_ACCEPTFILTER socket option from uipc_socket.c to uipc_accf.c in do_getopt_accept_filter(), so that it now matches do_setopt_accept_filter(). Slightly reformulate the logic to match the optimistic allocation of storage for the argument in advance, and slightly expand the coverage of the socket lock.
# 143425	11-Mar-2005	rwatson	Remove an additional commented out reference to a possible future sx lock.
# 143422	11-Mar-2005	rwatson	When setting up a socket in socreate(), there's no need to lock the socket lock around knlist_init(), so don't. Hard code the setting of the socket reference count to 1 rather than using soref() to avoid asserting the socket lock, since we've not yet exposed the socket to other threads. This removes two mutex operations from each socket allocation.
# 143421	11-Mar-2005	rwatson	Remove suggestive sx_init() comment in soalloc(). We will have something like this at some point, but for now it clutters the source.
# 142190	21-Feb-2005	rwatson	In the current world order, solisten() implements the state transition of a socket from a regular socket to a listening socket able to accept new connections. As part of this state transition, solisten() calls into the protocol to update protocol-layer state. There were several bugs in this implementation that could result in a race wherein a TCP SYN received in the interval between the protocol state transition and the shortly following socket layer transition would result in a panic in the TCP code, as the socket would be in the TCPS_LISTEN state, but the socket would not have the SO_ACCEPTCONN flag set. This change does the following: - Pushes the socket state transition from the socket layer solisten() to to socket "library" routines called from the protocol. This permits the socket routines to be called while holding the protocol mutexes, preventing a race exposing the incomplete socket state transition to TCP after the TCP state transition has completed. The check for a socket layer state transition is performed by solisten_proto_check(), and the actual transition is performed by solisten_proto(). - Holds the socket lock for the duration of the socket state test and set, and over the protocol layer state transition, which is now possible as the socket lock is acquired by the protocol layer, rather than vice versa. This prevents additional state related races in the socket layer. This permits the dual transition of socket layer and protocol layer state to occur while holding locks for both layers, making the two changes atomic with respect to one another. Similar changes are likely require elsewhere in the socket/protocol code. Reported by: Peter Holm <peter@holm.cc> Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net> Philosophical head nod: gnn
# 142127	20-Feb-2005	rwatson	In soreceive(), when considering delivery to a socket in SS_ISCONFIRMING, only call the protocol's pru_rcvd() if the protocol has the flag PR_WANTRCVD set. This brings that instance of pru_rcvd() into line with the rest, which do check the flag. MFC after: 3 days
# 142062	18-Feb-2005	rwatson	Correct a typo in the comment describing soreceive_rcvoob(). MFC after: 3 days
# 142061	18-Feb-2005	rwatson	In soconnect(), when resetting so->so_error, the socket lock is not required due to a straight integer write in which minor races are not a problem.
# 142058	18-Feb-2005	rwatson	Move do_setopt_accept_filter() from uipc_socket.c to uipc_accf.c, where the rest of the accept filter code currently lives. MFC after: 3 days
# 142055	18-Feb-2005	rwatson	Re-order checks in socheckuid() so that we check all deny cases before returning accept. MFC after: 3 days
# 142034	17-Feb-2005	rwatson	In solisten(), unconditionally set the SO_ACCEPTCONN option in so->so_options when solisten() will succeed, rather than setting it conditionally based on there not being queued sockets in the completed socket queue. Otherwise, if the protocol exposes new sockets via the completed queue before solisten() completes, the listen() system call will succeed, but the socket and protocol state will be out of sync. For TCP, this didn't happen in practice, as the TCP code will panic if a new connection comes in after the tcpcb has been transitioned to a listening state but the socket doesn't have SO_ACCEPTCONN set. This is historical behavior resulting from bitrot since 4.3BSD, in which that line of code was associated with the conditional NULL'ing of the connection queue pointers (one-time initialization to be performed during the transition to a listening socket), which are now initialized separately. Discussed with: fenner, gnn MFC after: 3 days
# 140730	24-Jan-2005	glebius	- Convert so_qlen, so_incqlen, so_qlimit fields of struct socket from short to unsigned short. - Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX. Before this change setting somaxconn to smth above 32767 and calling listen(fd, -1) lead to a socket, which doesn't accept connections at all. Reviewed by: rwatson Reported by: Igor Sysoev
# 140112	12-Jan-2005	sobomax	When re-connecting already connected datagram socket ensure to clean up its pending error state, which may be set in some rare conditions resulting in connect() syscall returning that bogus error and making application believe that attempt to change association has failed, while it has not in fact. There is sockets/reconnect regression test which excersises this bug. MFC after: 2 weeks
# 139804	06-Jan-2005	imp	/* -> /*- for copyright notices, minor format tweaks as necessary
# 139216	22-Dec-2004	rwatson	Remove an XXXRW indicating atomic operations might be used as a substitute for a global mutex protecting the socket count and generation number. The observation that soreceive_rcvoob() can't return an mbuf chain is a property, not a bug, so remove the XXXRW. In sorflush, s/existing/previous/ for code when describing prior behavior. For SO_LINGER socket option retrieval, remove an XXXRW about why we hold the mutex: this is correct and not dubious. MFC after: 2 weeks
# 139215	22-Dec-2004	rwatson	In soalloc(), simplify the mac_init_socket() handling to remove unnecessary use of a global variable and simplify the return case. While here, use ()'s around return values. In sodealloc(), remove a comment about why we bump the gencnt and decrement the socket count separately. It doesn't add substantially to the reading, and clutters the function. MFC after: 2 weeks
# 138647	10-Dec-2004	alc	Remove unneeded code from the zero-copy receive path. Discussed with: gallatin@ Tested by: ken@
# 138539	08-Dec-2004	alc	Tidy up the zero-copy receive path: Remove an unneeded argument to uiomoveco() and userspaceco().
# 138206	29-Nov-2004	ps	If soreceive() is called from a socket callback, there's no reason to do a window update to the peer (thru an ACK) from soreceive() itself. TCP will do that upon return from the socket callback. Sending a window update from soreceive() results in a lock reversal. Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com Reviewed by: rwatson
# 138205	29-Nov-2004	ps	Make soreceive(MSG_DONTWAIT) nonblocking. If MSG_DONTWAIT is passed into soreceive(), then pass in M_DONTWAIT to m_copym(). Also fix up error handling for the case where m_copym() returns failure. Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com Reviewed by: rwatson
# 137473	09-Nov-2004	glebius	Since sb_timeo type was increased to int, use INT_MAX instead of SHRT_MAX. This also gives us ability to close PR. PR: kern/42352 Approved by: julian (mentor) MFC after: 1 week
# 137129	02-Nov-2004	rwatson	Acquire the accept mutex in soabort() before calling sotryfree(), as that is now required. RELENG_5_3 candidate. Foot provided by: Dikshie <dikshie at ppk dot itb dot ac dot id>
# 136822	23-Oct-2004	andre	socreate() does an early abort if either the protocol cannot be found, or pru_attach is NULL. With loadable protocols the SPACER dummy protocols have valid function pointers for all methods to functions returning just EOPNOTSUPP. Thus the early abort check would not detect immediately that attach is not supported for this protocol. Instead it would correctly get the EOPNOTSUPP error later on when it calls the protocol specific attach function. Add testing against the pru_attach_notsupp() function pointer to the early abort check as well.
# 136682	18-Oct-2004	rwatson	Push acquisition of the accept mutex out of sofree() into the caller (sorele()/sotryfree()): - This permits the caller to acquire the accept mutex before the socket mutex, avoiding sofree() having to drop the socket mutex and re-order, which could lead to races permitting more than one thread to enter sofree() after a socket is ready to be free'd. - This also covers clearing of the so_pcb weak socket reference from the protocol to the socket, preventing races in clearing and evaluation of the reference such that sofree() might be called more than once on the same socket. This appears to close a race I was able to easily trigger by repeatedly opening and resetting TCP connections to a host, in which the tcp_close() code called as a result of the RST raced with the close() of the accepted socket in the user process resulting in simultaneous attempts to de-allocate the same socket. The new locking increases the overhead for operations that may potentially free the socket, so we will want to revise the synchronization strategy here as we normalize the reference counting model for sockets. The use of the accept mutex in freeing of sockets that are not listen sockets is primarily motivated by the potential need to remove the socket from the incomplete connection queue on its parent (listen) socket, so cleaning up the reference model here may allow us to substantially weaken the synchronization requirements. RELENG_5_3 candidate. MFC after: 3 days Reviewed by: dwhite Discussed with: gnn, dwhite, green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>
# 136373	11-Oct-2004	rwatson	Rework sofree() logic to take into account a possible race with accept(). Sockets in the listen queues have reference counts of 0, so if the protocol decides to disconnect the pcb and try to free the socket, this triggered a race with accept() wherein accept() would bump the reference count before sofree() had removed the socket from the listen queues, resulting in a panic in sofree() when it discovered it was freeing a referenced socket. This might happen if a RST came in prior to accept() on a TCP connection. The fix is two-fold: to expand the coverage of the accept mutex earlier in sofree() to prevent accept() from grabbing the socket after the "is it really safe to free" tests, and to expand the logic of the "is it really safe to free" tests to check that the refcount is still 0 (i.e., we didn't race). RELENG_5 candidate. Much discussion with and work by: green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>
# 134815	05-Sep-2004	rwatson	Expand the scope of the socket buffer locks in sopoll() to include the state test as well as set, or we risk a race between a socket wakeup and registering for select() or poll() on the socket. This does increase the cost of the poll operation, but can probably be optimized some in the future. This appears to correct poll() "wedges" experienced with X11 on SMP systems with highly interactive applications, and might affect a plethora of other select() driven applications. RELENG_5 candidate. Problem reported by: Maxim Maximov <mcsi at mcsi dot pp dot ru> Debugged with help of: dwhite
# 134240	24-Aug-2004	rwatson	Conditional acquisition of socket buffer mutexes when testing socket buffers with kqueue filters is no longer required: the kqueue framework will guarantee that the mutex is held on entering the filter, either due to a call from the socket code already holding the mutex, or by explicitly acquiring it. This removes the last of the conditional socket locking.
# 134084	20-Aug-2004	rwatson	Back out uipc_socket.c:1.208, as it incorrectly assumes that all sockets are connection-oriented for the purposes of kqueue registration. Since UDP sockets aren't connection-oriented, this appeared to break a great many things, such as RPC-based applications and services (i.e., NFS). Since jmg isn't around I'm backing this out before too many more feet are shot, but intend to investigate the right solution with him once he's available. Apologies to: jmg Discussed with: imp, scottl
# 134062	20-Aug-2004	jmg	make sure that the socket is either accepting connections or is connected when attaching a knote to it... otherwise return EINVAL... Pointed out by: benno
# 133741	15-Aug-2004	jmg	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
# 133467	11-Aug-2004	rwatson	Replace a reference to splnet() with a reference to locking in a comment.
# 132644	25-Jul-2004	rwatson	Do some initial locking on accept filter registration and attach. While here, close some races that existed in the pre-locking world during low memory conditions. This locking isn't perfect, but it's closer than before.
# 132359	18-Jul-2004	dwmalone	The recent changes to control message passing broke some things that get certain types of control messages (ping6 and rtsol are examples). This gets the new code closer to working: 1) Collect control mbufs for processing in the controlp == NULL case, so that they can be freed by externalize. 2) Loop over the list of control mbufs, as the externalize function may not know how to deal with chains. 3) In the case where there is no externalize function, remember to add the control mbuf to the controlp list so that it will be returned. 4) After adding stuff to the controlp list, walk to the end of the list of stuff that was added, incase we added a chain. This code can be further improved, but this is enough to get most things working again. Reviewed by: rwatson
# 132230	15-Jul-2004	rwatson	When entering soclose(), assert that SS_NOFDREF is not already set.
# 132060	12-Jul-2004	dwmalone	Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a a better name. I have a kern_[sg]etsockopt which I plan to commit shortly, but the arguments to these function will be quite different from so_setsockopt. Approved by: alfred
# 132018	12-Jul-2004	alfred	Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts. Tune the timeout from 5 seconds to 12 seconds. Provide a sysctl to show how many reconnects the NFS client has done. Seems to fix IPv6 from: kuriyama
# 131999	11-Jul-2004	rwatson	Use sockbuf_pushsync() to synchronize stack and socket buffer state in soreceive() after removing an MT_SONAME mbuf from the head of the socket buffer. When processing MT_CONTROL mbufs in soreceive(), first remove all of the MT_CONTROL mbufs from the head of the socket buffer to a local mbuf chain, then feed them into dom_externalize() as a set, which both avoids thrashing the socket buffer lock when handling multiple control mbufs, and also avoids races with other threads acting on the socket buffer when the socket buffer mutex is released to enter the externalize code. Existing races that might occur if the protocol externalize method blocked during processing have also been closed. Now that we synchronize socket buffer and stack state following modifications to the socket buffer, turn the manual synchronization that previously followed control mbuf processing with a set of assertions. This can eventually be removed. The soreceive() code is now substantially more MPSAFE.
# 131997	11-Jul-2004	rwatson	Add sockbuf_pushsync(), an inline function that, following a change to the head of the mbuf chains in a socket buffer, re-synchronizes the cache pointers used to optimize socket buffer appends. This will be used by soreceive() before dropping socket buffer mutexes to make sure a consistent version of the socket buffer is visible to other threads. While here, update copyright to account for substantial rewrite of much socket code required for fine-grained locking.
# 131993	11-Jul-2004	rwatson	Add additional annotations to soreceive(), documenting the effects of locking on 'nextrecord' and concerns regarding potentially inconsistent or stale use of socket buffer or stack fields if they aren't carefully synchronized whenever the socket buffer mutex is released. Document that the high-level sblock() prevents races against other readers on the socket. Also document the 'type' logic as to how soreceive() guarantees that it will only return one of normal data or inline out-of-band data.
# 131959	10-Jul-2004	rwatson	In the 'dontblock' section of soreceive(), assert that the mbuf on hand ('m') is in fact the first mbuf in the receive socket buffer.
# 131956	10-Jul-2004	rwatson	Break out non-inline out-of-band data receive code from soreceive() and put it in its own helper function soreceive_rcvoob().
# 131955	10-Jul-2004	rwatson	Assign pointers values of NULL rather than 0 in soreceive().
# 131932	10-Jul-2004	rwatson	When the MT_SONAME mbuf is popped off of a receive socket buffer associated with a PR_ADDR protocol, make sure to update the m_nextpkt pointer of the new head mbuf on the chain to point to the next record. Otherwise, when we release the socket buffer mutex, the socket buffer mbuf chain may be in an inconsistent state.
# 131890	10-Jul-2004	rwatson	Now socket buffer locks are being asserted at higher code blocks in soreceive(), remove some leaf assertions that are redundant.
# 131889	10-Jul-2004	rwatson	Assert socket buffer lock at strategic points between sections of code in soreceive() to confirm we've moved from block to block properly maintaining locking invariants.
# 131644	05-Jul-2004	rwatson	Drop the socket buffer lock around a call to m_copym() with M_TRYWAIT. A subset of locking changes to soreceive() in the queue for merging. Bumped into by: Willem Jan Withagen <wjw@withagen.nl>
# 131167	27-Jun-2004	rwatson	Add a new global mutex, so_global_mtx, which protects the global variables so_gencnt, numopensockets, and the per-socket field so_gencnt. Annotate this this might be better done with atomic operations. Annotate what accept_mtx protects.
# 131145	26-Jun-2004	rwatson	Replace comment on spl state when calling soabort() with a comment on locking state. No socket locks should be held when calling soabort() as it will call into protocol code that may acquire socket locks.
# 131030	24-Jun-2004	rwatson	Lock socket buffers when processing setting socket options SO_SNDLOWAT or SO_RCVLOWAT for read-modify-write.
# 131005	23-Jun-2004	rwatson	Slide socket buffer lock earlier in sopoll() to cover the call into selrecord(), setting up select and flagging the socker buffers as SB_SEL and setting up select under the lock.
# 130899	22-Jun-2004	rwatson	Remove spl's from uipc_socket to ease in merging.
# 130831	20-Jun-2004	rwatson	Merge next step in socket buffer locking: - sowakeup() now asserts the socket buffer lock on entry. Move the call to KNOTE higher in sowakeup() so that it is made with the socket buffer lock held for consistency with other calls. Release the socket buffer lock prior to calling into pgsigio(), so_upcall(), or aio_swake(). Locking for this event management will need revisiting in the future, but this model avoids lock order reversals when upcalls into other subsystems result in socket/socket buffer operations. Assert that the socket buffer lock is not held at the end of the function. - Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now have _locked versions which assert the socket buffer lock on entry. If a wakeup is required by sb_notify(), invoke sowakeup(); otherwise, unconditionally release the socket buffer lock. This results in the socket buffer lock being released whether a wakeup is required or not. - Break out socantsendmore() into socantsendmore_locked() that asserts the socket buffer lock. socantsendmore() unconditionally locks the socket buffer before calling socantsendmore_locked(). Note that both functions return with the socket buffer unlocked as socantsendmore_locked() calls sowwakeup_locked() which has the same properties. Assert that the socket buffer is unlocked on return. - Break out socantrcvmore() into socantrcvmore_locked() that asserts the socket buffer lock. socantrcvmore() unconditionally locks the socket buffer before calling socantrcvmore_locked(). Note that both functions return with the socket buffer unlocked as socantrcvmore_locked() calls sorwakeup_locked() which has similar properties. Assert that the socket buffer is unlocked on return. - Break out sbrelease() into a sbrelease_locked() that asserts the socket buffer lock. sbrelease() unconditionally locks the socket buffer before calling sbrelease_locked(). sbrelease_locked() now invokes sbflush_locked() instead of sbflush(). - Assert the socket buffer lock in socket buffer sanity check functions sblastrecordchk(), sblastmbufchk(). - Assert the socket buffer lock in SBLINKRECORD(). - Break out various sbappend() functions into sbappend_locked() (and variations on that name) that assert the socket buffer lock. The !_locked() variations unconditionally lock the socket buffer before calling their _locked counterparts. Internally, make sure to call _locked() support routines, etc, if already holding the socket buffer lock. - Break out sbinsertoob() into sbinsertoob_locked() that asserts the socket buffer lock. sbinsertoob() unconditionally locks the socket buffer before calling sbinsertoob_locked(). - Break out sbflush() into sbflush_locked() that asserts the socket buffer lock. sbflush() unconditionally locks the socket buffer before calling sbflush_locked(). Update panic strings for new function names. - Break out sbdrop() into sbdrop_locked() that asserts the socket buffer lock. sbdrop() unconditionally locks the socket buffer before calling sbdrop_locked(). - Break out sbdroprecord() into sbdroprecord_locked() that asserts the socket buffer lock. sbdroprecord() unconditionally locks the socket buffer before calling sbdroprecord_locked(). - sofree() now calls socantsendmore_locked() and re-acquires the socket buffer lock on return. It also now calls sbrelease_locked(). - sorflush() now calls socantrcvmore_locked() and re-acquires the socket buffer lock on return. Clean up/mess up other behavior in sorflush() relating to the temporary stack copy of the socket buffer used with dom_dispose by more properly initializing the temporary copy, and selectively bzeroing/copying more carefully to prevent WITNESS from getting confused by improperly initialized mutexes. Annotate why that's necessary, or at least, needed. - soisconnected() now calls sbdrop_locked() before unlocking the socket buffer to avoid locking overhead. Some parts of this change were: Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
# 130801	20-Jun-2004	rwatson	When retrieving the SO_LINGER socket option for user space, hold the socket lock over pulling so_options and so_linger out of the socket structure in order to retrieve a consistent snapshot. This may be overkill if user space doesn't require a consistent snapshot.
# 130800	20-Jun-2004	rwatson	Convert an if->panic in soclose() into a call to KASSERT().
# 130797	20-Jun-2004	rwatson	Annotate some ordering-related issues in solisten() which are not yet resolved by socket locking: in particular, that we test the connection state at the socket layer without locking, request that the protocol begin listening, and then set the listen state on the socket non-atomically, resulting in a non-atomic cross-layer test-and-set.
# 130705	19-Jun-2004	rwatson	Assert socket buffer lock in sb_lock() to protect socket buffer sleep lock state. Convert tsleep() into msleep() with socket buffer mutex as argument. Hold socket buffer lock over sbunlock() to protect sleep lock state. Assert socket buffer lock in sbwait() to protect the socket buffer wait state. Convert tsleep() into msleep() with socket buffer mutex as argument. Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK() in order to call into these functions with the lock, as well as to start protecting other socket buffer use in their implementation. Drop the socket buffer mutexes around calls into the protocol layer, around potentially blocking operations, for copying to/from user space, and VM operations relating to zero-copy. Assert the socket buffer mutex strategically after code sections or at the beginning of loops. In some cases, modify return code to ensure locks are properly dropped. Convert the potentially blocking allocation of storage for the remote address in soreceive() into a non-blocking allocation; we may wish to move the allocation earlier so that it can block prior to acquisition of the socket buffer lock. Drop some spl use. NOTE: Some races exist in the current structuring of sosend() and soreceive(). This commit only merges basic socket locking in this code; follow-up commits will close additional races. As merged, these changes are not sufficient to run without Giant safely. Reviewed by: juli, tjr
# 130668	18-Jun-2004	rwatson	Hold SOCK_LOCK(so) while frobbing so_options. Note that while the local race is corrected, there's still a global race in sosend() relating to so_options and the SO_DONTROUTE flag.
# 130665	18-Jun-2004	rwatson	Merge some additional leaf node socket buffer locking from rwatson_netperf: Introduce conditional locking of the socket buffer in fifofs kqueue filters; KNOTE() will be called holding the socket buffer locks in fifofs, but sometimes the kqueue() system call will poll using the same entry point without holding the socket buffer lock. Introduce conditional locking of the socket buffer in the socket kqueue filters; KNOTE() will be called holding the socket buffer locks in the socket code, but sometimes the kqueue() system call will poll using the same entry points without holding the socket buffer lock. Simplify the logic in sodisconnect() since we no longer need spls. NOTE: To remove conditional locking in the kqueue filters, it would make sense to use a separate kqueue API entry into the socket/fifo code when calling from the kqueue() system call.
# 130653	17-Jun-2004	rwatson	Merge additional socket buffer locking from rwatson_netperf: - Lock down low hanging fruit use of sb_flags with socket buffer lock. - Lock down low hanging fruit use of so_state with socket lock. - Lock down low hanging fruit use of so_options. - Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with socket buffer lock. - Annotate situations in which we unlock the socket lock and then grab the receive socket buffer lock, which are currently actually the same lock. Depending on how we want to play our cards, we may want to coallesce these lock uses to reduce overhead. - Convert a if()->panic() into a KASSERT relating to so_state in soaccept(). - Remove a number of splnet()/splx() references. More complex merging of socket and socket buffer locking to follow.
# 130480	14-Jun-2004	rwatson	The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state: SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state) Rename respectively to: SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state) This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.
# 130387	12-Jun-2004	rwatson	Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count: - Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele(). - Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree(). - Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers. - In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket. - Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket. Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
# 130380	12-Jun-2004	rwatson	Introduce a mutex into struct sockbuf, sb_mtx, which will be used to protect fields in the socket buffer. Add accessor macros to use the mutex (SOCKBUF_()). Initialize the mutex in soalloc(), and destroy it in sodealloc(). Add addition, add SOCK_() access macros which will protect most remaining fields in the socket; for the time being, use the receive socket buffer mutex to implement socket level locking to reduce memory overhead. Submitted by: sam Sponosored by: FreeBSD Foundation Obtained from: BSD/OS
# 130246	08-Jun-2004	stefanf	Avoid assignments to cast expressions. Reviewed by: md5 Approved by: das (mentor)
# 129979	02-Jun-2004	rwatson	Integrate accept locking from rwatson_netperf, introducing a new global mutex, accept_mtx, which serializes access to the following fields across all sockets: so_qlen so_incqlen so_qstate so_comp so_incomp so_list so_head While providing only coarse granularity, this approach avoids lock order issues between sockets by avoiding ownership of the fields by a specific socket and its per-socket mutexes. While here, rewrite soclose(), sofree(), soaccept(), and sonewconn() to add assertions, close additional races and address lock order concerns. In particular: - Reorganize the optimistic concurrency behavior in accept1() to always allocate a file descriptor with falloc() so that if we do find a socket, we don't have to encounter the "Oh, there wasn't a socket" race that can occur if falloc() sleeps in the current code, which broke inbound accept() ordering, not to mention requiring backing out socket state changes in a way that raced with the protocol level. We may want to add a lockless read of the queue state if polling of empty queues proves to be important to optimize. - In accept1(), soref() the socket while holding the accept lock so that the socket cannot be free'd in a race with the protocol layer. Likewise in netgraph equivilents of the accept1() code. - In sonewconn(), loop waiting for the queue to be small enough to insert our new socket once we've committed to inserting it, or races can occur that cause the incomplete socket queue to overfill. In the previously implementation, it was sufficient to simply tested once since calling soabort() didn't release synchronization permitting another thread to insert a socket as we discard a previous one. - In soclose()/sofree()/et al, it is the responsibility of the caller to remove a socket from the incomplete connection queue before calling soabort(), which prevents soabort() from having to walk into the accept socket to release the socket from its queue, and avoids races when releasing the accept mutex to enter soabort(), permitting soabort() to avoid lock ordering issues with the caller. - Generally cluster accept queue related operations together throughout these functions in order to facilitate locking. Annotate new locking in socketvar.h.
# 129916	01-Jun-2004	rwatson	The SS_COMP and SS_INCOMP flags in the so_state field indicate whether the socket is on an accept queue of a listen socket. This change renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new state field on the socket, so_qstate, as the locking for these flags is substantially different for the locking on the remainder of the flags in so_state.
# 129911	31-May-2004	truckman	Add MSG_NBIO flag option to soreceive() and sosend() that causes them to behave the same as if the SS_NBIO socket flag had been set for this call. The SS_NBIO flag for ordinary sockets is set by fcntl(fd, F_SETFL, O_NONBLOCK). Pass the MSG_NBIO flag to the soreceive() and sosend() calls in fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag on the underlying socket for each I/O operation. The O_NONBLOCK flag is a property of the descriptor, and unlike ordinary sockets, fifos may be referenced by multiple descriptors.
# 129906	31-May-2004	bmilekic	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
# 128052	09-Apr-2004	rwatson	Compare pointers with NULL rather than using pointers are booleans in if/for statements. Assign pointers to NULL rather than typecast 0. Compare pointers with NULL rather than 0.
# 127911	05-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 127656	31-Mar-2004	rwatson	In sofree(), avoid nested declaration and initialization in declaration. Observe that initialization in declaration is frequently incompatible with locking, not just a bad idea due to style(9). Submitted by: bde
# 127577	29-Mar-2004	rwatson	Use a common return path for filt_soread() and filt_sowrite() to simplify the impact of locking on these functions. Submitted by: sam Sponsored by: FreeBSD Foundation
# 127575	29-Mar-2004	rwatson	In sofree(), moving caching of 'head' from 'so->so_head' to later in the function once it has been determined to be non-NULL to simplify locking on an earlier return.
# 126425	01-Mar-2004	rwatson	Rename dup_sockaddr() to sodupsockaddr() for consistency with other functions in kern_socket.c. Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT in from the caller context rather than "1" or "0". Correct mflags pass into mac_init_socket() from previous commit to not include M_ZERO. Submitted by: sam
# 126422	29-Feb-2004	scottl	Convert the other use of flags to mflags in soalloc().
# 126411	29-Feb-2004	rwatson	Modify soalloc() API so that it accepts a malloc flags argument rather than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT rather than 0 or 1. This simplifies the soalloc() logic, and also makes the waiting behavior of soalloc() more clear in the calling context. Submitted by: sam
# 125724	11-Feb-2004	green	Always socantsendmore() before deallocating a socket. This, in turn, calls selwakeup() if necessary (which it is, if you don't want freed memory hanging around on your td->td_selq). Props to: alfred
# 125264	31-Jan-2004	phk	Introduce the SO_BINTIME option which takes a high-resolution timestamp at packet arrival. For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL since it has higher resolution and lower overhead. Simultaneous use of the two options is possible and they will return consistent timestamps. This introduces an extra test and a function call for SO_TIMEVAL, but I have not been able to measure that.
# 124674	18-Jan-2004	ru	Since "m" is not part of the "mp" chain, need to free() it. Reported by: Stanford Metacompilation research group
# 122807	16-Nov-2003	rwatson	Reduce gratuitous redundancy and length in function names: mac_setsockopt_label_set() -> mac_setsockopt_label() mac_getsockopt_label_get() -> mac_getsockopt_label() mac_getsockopt_peerlabel_get() -> mac_getsockopt_peerlabel() Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 122775	16-Nov-2003	rwatson	When implementing getsockopt() for SO_LABEL and SO_PEERLABEL, make sure to sooptcopyin() the (struct mac) so that the MAC Framework knows which label types are being requested. This fixes process queries of socket labels. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 122352	09-Nov-2003	tanimura	- Implement selwakeuppri() which allows raising the priority of a thread being waken up. The thread waken up can run at a priority as high as after tsleep(). - Replace selwakeup()s with selwakeuppri()s and pass appropriate priorities. - Add cv_broadcastpri() which raises the priority of the broadcast threads. Used by selwakeuppri() if collision occurs. Not objected in: -arch, -current
# 121628	28-Oct-2003	sam	speedup stream socket recv handling by tracking the tail of the mbuf chain instead of walking the list for each append Submitted by: ps/jayanth Obtained from: netbsd (jason thorpe)
# 121307	21-Oct-2003	silby	Change all SYSCTLS which are readonly and have a related TUNABLE from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide more useful error messages.
# 118453	04-Aug-2003	hsu	Make the second argument to sooptcopyout() constant in order to simplify the upcoming PIM patches. Submitted by: Pavlin Radoslavov <pavlin@icir.org>
# 117708	17-Jul-2003	robert	To avoid a kernel panic provoked by a NULL pointer dereference, do not clear the `sb_sel' member of the sockbuf structure while invalidating the receive sockbuf in sorflush(), called from soshutdown(). The panic was reproduceable from user land by attaching a knote with EVFILT_READ filters to a socket, disabling further reads from it using shutdown(2), and then closing it. knote_remove() was called to remove all knotes from the socket file descriptor by detaching each using its associated filterops' detach call- back function, sordetach() in this case, which tried to remove itself from the invalidated sockbuf's klist (sb_sel.si_note). PR: kern/54331
# 117595	14-Jul-2003	hsu	Rev 1.121 meant to pass the value 1 to soalloc() to indicate waitok. Reported by: arr
# 116182	10-Jun-2003	obrien	Use __FBSDID().
# 114216	29-Apr-2003	kan	Deprecate machine/limits.h in favor of new sys/limits.h. Change all in-tree consumers to include <sys/limits.h> Discussed on: standards@ Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
# 113477	14-Apr-2003	cognet	Use while (controlp != NULL) instead of do ... while (control != NULL) There are valid cases where *controlp will be NULL at this point. Discussed with: dwmalone
# 111742	02-Mar-2003	des	Clean up whitespace, s/register //, refrain from strong urge to ANSIfy.
# 111741	02-Mar-2003	des	uiomove-related caddr_t -> void * (just the low-hanging fruit)
# 111161	20-Feb-2003	cognet	Remove duplicate includes. Submitted by: Cyril Nguyen-Huu <cyril@ci0.org>
# 111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
# 109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 109439	17-Jan-2003	tmm	Disallow listen() on sockets which are in the SS_ISCONNECTED or SS_ISCONNECTING state, returning EINVAL (which is what POSIX mandates in this case). listen() on connected or connecting sockets would cause them to enter a bad state; in the TCP case, this could cause sockets to go catatonic or panics, depending on how the socket was connected. Reviewed by: -net MFC after: 2 weeks
# 109153	12-Jan-2003	dillon	Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.
# 109123	11-Jan-2003	dillon	Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it. Change struct xfile xf_data to xun_data (ABI is still compatible). If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.
# 108708	05-Jan-2003	alfred	In sodealloc(), if there is an accept filter present on the socket then call do_setopt_accept_filter(so, NULL) which will free the filter instead of duplicating the code in do_setopt_accept_filter(). Pointed out by: Hiten Pandya <hiten@angelica.unixdaemons.com>
# 108235	23-Dec-2002	phk	s/sokqfilter/soo_kqfilter/ for consistency with the naming of all other socket/file operations.
# 107309	27-Nov-2002	maxim	Small SO_RCVTIMEO and SO_SNDTIMEO values are mistakenly taken to be zero. PR: kern/32827 Submitted by: Hartmut Brandt <brandt@fokus.gmd.de> Approved by: re (jhb) MFC after: 2 weeks
# 106696	09-Nov-2002	alfred	Fix instances of macros with improperly parenthasized arguments. Verified by: md5
# 106472	05-Nov-2002	kbyanc	Fix filt_soread() to properly flag a kevent when a 0-byte datagram is received. Verified by: dougb, Manfred Antar <null@pozo.com> Sponsored by: NTT Multimedia Communications Labs
# 106326	02-Nov-2002	alc	Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing a panic because the socket's state isn't as expected by sofree(). Discussed with: dillon, fenner
# 106313	01-Nov-2002	kbyanc	Track the number of non-data chararacters stored in socket buffers so that the data value returned by kevent()'s EVFILT_READ filter on non-TCP sockets accurately reflects the amount of data that can be read from the sockets by applications. PR: 30634 Reviewed by: -net, -arch Sponsored by: NTT Multimedia Communications Labs MFC after: 2 weeks
# 106096	28-Oct-2002	rwatson	Trim extraneous #else and #endif MAC comments per style(9).
# 104541	05-Oct-2002	rwatson	Modify label allocation semantics for sockets: pass in soalloc's malloc flags so that we can call malloc with M_NOWAIT if necessary, avoiding potential sleeps while holding mutexes in the TCP syncache code. Similar to the existing support for mbuf label allocation: if we can't allocate all the necessary label store in each policy, we back out the label allocation and fail the socket creation. Sync from MAC tree. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 101983	16-Aug-2002	rwatson	Make similar changes to fo_stat() and fo_poll() as made earlier to fo_read() and fo_write(): explicitly use the cred argument to fo_poll() as "active_cred" using the passed file descriptor's f_cred reference to provide access to the file credential. Add an active_cred argument to fo_stat() so that implementers have access to the active credential as well as the file credential. Generally modify callers of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which was redundantly provided via the fp argument. This set of modifications also permits threads to perform these operations on behalf of another thread without modifying their credential. Trickle this change down into fo_stat/poll() implementations: - badfo_poll(), badfo_stat(): modify/add arguments. - kqueue_poll(), kqueue_stat(): modify arguments. - pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to MAC checks rather than td->td_ucred. - soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather than cred to pru_sopoll() to maintain current semantics. - sopoll(): moidfy arguments. - vn_poll(), vn_statfile(): modify/add arguments, pass new arguments to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL() to maintian current semantics. - vn_close(): rename cred to file_cred to reflect reality while I'm here. - vn_stat(): Add active_cred and file_cred arguments to vn_stat() and consumers so that this distinction is maintained at the VFS as well as 'struct file' layer. Pass active_cred instead of td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics. - fifofs: modify the creation of a "filetemp" so that the file credential is properly initialized and can be used in the socket code if desired. Pass ap->a_td->td_ucred as the active credential to soo_poll(). If we teach the vnop interface about the distinction between file and active credentials, we would use the active credential here. Note that current inconsistent passing of active_cred vs. file_cred to VOP's is maintained. It's not clear why GETATTR would be authorized using active_cred while POLL would be authorized using file_cred at the file system level. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 101746	12-Aug-2002	rwatson	Use the credential authorizing the socket creation operation to perform the jail check and the MAC socket labeling in socreate(). This handles socket creation using a cached credential better (such as in the NFS client code when rebuilding a socket following a disconnect: the new socket should be created using the nfsmount cached cred, not the cred of the thread causing the socket to be rebuilt). Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 101173	01-Aug-2002	rwatson	Include file cleanup; mac.h and malloc.h at one point had ordering relationship requirements, and no longer do. Reminded by: bde
# 101134	01-Aug-2002	rwatson	Introduce support for Mandatory Access Control and extensible kernel access control. Implement two IOCTLs at the socket level to retrieve the primary and peer labels from a socket. Note that this user process interface will be changing to improve multi-policy support. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 101013	31-Jul-2002	rwatson	Introduce support for Mandatory Access Control and extensible kernel access control. Invoke the necessary MAC entry points to maintain labels on sockets. In particular, invoke entry points during socket allocation and destruction, as well as creation by a process or during an accept-scenario (sonewconn). For UNIX domain sockets, also assign a peer label. As the socket code isn't locked down yet, locking interactions are not yet clear. Various protocol stack socket operations (such as peer label assignment for IPv4) will follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 100605	24-Jul-2002	mike	Catch up to rev 1.87 of sys/sys/socketvar.h (sb_cc changed from u_long to u_int). Noticed by: sparc64 tinderbox
# 98998	28-Jun-2002	alfred	More caddr_t removal. Change struct knote's kn_hook from caddr_t to void *.
# 98849	26-Jun-2002	ken	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.
# 98499	20-Jun-2002	alfred	Implement SO_NOSIGPIPE option for sockets. This allows one to request that an EPIPE error return not generate SIGPIPE on sockets. Submitted by: lioux Inspired by: Darwin
# 97658	31-May-2002	tanimura	Back out my lats commit of locking down a socket, it conflicts with hsu's work. Requested by: hsu
# 97085	21-May-2002	arr	- td will never be NULL, so the call to soalloc() in socreate() will always be passed a 1; we can, however, use M_NOWAIT to indicate this. - Check so against NULL since it's a pointer to a structure.
# 97083	21-May-2002	arr	- OR the flag variable with M_ZERO so that the uma_zalloc() handles the zero'ing out of the allocated memory. Also removed the logical bzero that followed.
# 96972	20-May-2002	tanimura	Lock down a socket, milestone 1. o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket. o Determine the lock strategy for each members in struct socket. o Lock down the following members: - so_count - so_options - so_linger - so_state o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket: - sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup() Reviewed by: alfred
# 96122	06-May-2002	alfred	Make funsetown() take a 'struct sigio **' so that the locking can be done internally. Ensure that no one can fsetown() to a dying process/pgrp. We need to check the process for P_WEXIT to see if it's exiting. Process groups are already safe because there is no such thing as a pgrp zombie, therefore the proctree lock completely protects the pgrp from having sigio structures associated with it after it runs funsetownlst. Add sigio lock to witness list under proctree and allproc, but over proc and pgrp. Seigo Tanimura helped with this.
# 95883	01-May-2002	alfred	Redo the sigio locking. Turn the sigio sx into a mutex. Sigio lock is really only needed to protect interrupts from dereferencing the sigio pointer in an object when the sigio itself is being destroyed. In order to do this in the most unintrusive manner change pgsigio's sigio * argument into a **, that way we can lock internally to the function.
# 95478	26-Apr-2002	silby	Make sure that sockets undergoing accept filtering are aborted in a LRU fashion when the listen queue fills up. Previously, there was no mechanism to kick out old sockets, leading to an easy DoS of daemons using accept filtering. Reviewed by: alfred MFC after: 3 days
# 94160	08-Apr-2002	hsu	There's only one socket zone so we don't need to remember it in every socket structure.
# 92827	20-Mar-2002	jeff	UMA permited us to utilize the 'waitok' flag to soalloc.
# 92751	20-Mar-2002	jeff	Remove references to vm_zone.h and switch over to the new uma API. Also, remove maxsockets. If you look carefully you'll notice that the old zone allocator never honored this anyway.
# 92654	19-Mar-2002	jeff	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@
# 91482	28-Feb-2002	iedowse	In sosend(), enforce the socket buffer limits regardless of whether the data was supplied as a uio or an mbuf. Previously the limit was ignored for mbuf data, and NFS could run the kernel out of mbufs when an ipfw rule blocked retransmissions.
# 91406	27-Feb-2002	jhb	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
# 90227	05-Feb-2002	dillon	Get rid of the twisted MFREE() macro entirely. Reviewed by: dg, bmilekic MFC after: 3 days
# 89376	14-Jan-2002	alfred	Fix select on fifos. Backout revision 1.56 and 1.57 of fifo_vnops.c. Introduce a new poll op "POLLINIGNEOF" that can be used to ignore EOF on a fifo, POLLIN/POLLRDNORM is converted to POLLINIGNEOF within the FIFO implementation to effect the correct behavior. This should allow one to view a fifo pretty much as a data source rather than worry about connections coming and going. Reviewed by: bde
# 88739	31-Dec-2001	rwatson	o Make the credential used by socreate() an explicit argument to socreate(), rather than getting it implicitly from the thread argument. o Make NFS cache the credential provided at mount-time, and use the cached credential (nfsmount->nm_cred) when making calls to socreate() on initially connecting, or reconnecting the socket. This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well as bugs involving NFS and mandatory access control implementations. Reviewed by: freebsd-arch
# 86487	17-Nov-2001	dillon	Give struct socket structures a ref counting interface similar to vnodes. This will hopefully serve as a base from which we can expand the MP code. We currently do not attempt to obtain any mutex or SX locks, but the door is open to add them when we nail down exactly how that part of it is going to work.
# 86306	12-Nov-2001	keramida	Remove EOL whitespace. Reviewed by: alfred
# 86305	12-Nov-2001	keramida	Make KASSERT's print the values that triggered a panic. Reviewed by: alfred
# 84827	11-Oct-2001	jhb	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.
# 84736	09-Oct-2001	rwatson	- Combine kern.ps_showallprocs and kern.ipc.showallsockets into a single kern.security.seeotheruids_permitted, describes as: "Unprivileged processes may see subjects/objects with different real uid" NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is an API change. kern.ipc.showallsockets does not. - Check kern.security.seeotheruids_permitted in cr_cansee(). - Replace visibility calls to socheckuid() with cr_cansee() (retain the change to socheckuid() in ipfw, where it is used for rule-matching). - Remove prison_unpcb() and make use of cr_cansee() against the UNIX domain socket credential instead of comparing root vnodes for the UDS and the process. This allows multiple jails to share the same chroot() and not see each others UNIX domain sockets. - Remove unused socheckproc(). Now that cr_cansee() is used universally for socket visibility, a variety of policies are more consistently enforced, including uid-based restrictions and jail-based restrictions. This also better-supports the introduction of additional MAC models. Reviewed by: ps, billf Obtained from: TrustedBSD Project
# 84527	05-Oct-2001	ps	Only allow users to see their own socket connections if kern.ipc.showallsockets is set to 0. Submitted by: billf (with modifications by me) Inspired by: Dave McKay (aka pm aka Packet Magnet) Reviewed by: peter MFC after: 2 weeks
# 84472	04-Oct-2001	dwmalone	Hopefully improve control message passing over Unix domain sockets. 1) Allow the sending of more than one control message at a time over a unix domain socket. This should cover the PR 29499. 2) This requires that unp_{ex,in}ternalize and unp_scan understand mbufs with more than one control message at a time. 3) Internalize and externalize used to work on the mbuf in-place. This made life quite complicated and the code for sizeof(int) < sizeof(file ) could end up doing the wrong thing. The patch always create a new mbuf/cluster now. This resulted in the change of the prototype for the domain externalise function. 4) You can now send SCM_TIMESTAMP messages. 5) Always use CMSG_DATA(cm) to determine the start where the data in unp_{ex,in}ternalize. It was using ((struct cmsghdr )cm + 1) in some places, which gives the wrong alignment on the alpha. (NetBSD made this fix some time ago). This results in an ABI change for discriptor passing and creds passing on the alpha. (Probably on the IA64 and Spare ports too). 6) Fix userland programs to use CMSG_* macros too. 7) Be more careful about freeing mbufs containing (file *)s. This is made possible by the prototype change of externalise. PR: 29499 MFC after: 6 weeks
# 83805	21-Sep-2001	jhb	Use the passed in thread to selrecord() instead of curthread.
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 76166	01-May-2001	markm	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)
# 76075	27-Apr-2001	alfred	Actually show the values that tripped the assertion "receive 1"
# 74371	16-Mar-2001	jlemon	When doing a recv(.. MSG_WAITALL) for a message which is larger than the socket buffer size, the receive is done in sections. After completing a read, call pru_rcvd on the underlying protocol before blocking again. This allows the the protocol to take appropriate action, such as sending a TCP window update to the peer, if the window happened to close because the socket buffer was filled. If the protocol is not notified, a TCP transfer may stall until the remote end sends a window probe.
# 74018	09-Mar-2001	jlemon	Push the test for a disconnected socket when accept()ing down to the protocol layer. Not all protocols behave identically. This fixes the brokenness observed with unix-domain sockets (and postfix)
# 73153	27-Feb-2001	ru	In soshutdown(), use SHUT_{RD,WR,RDWR} instead of FREAD and FWRITE. Also, return EINVAL if `how' is invalid, as required by POSIX spec.
# 72968	23-Feb-2001	jlemon	Introduce a NOTE_LOWAT flag for use with the read/write filters, which allow the watermark to be passed in via the data field during the EV_ADD operation. Hook this up to the socket read/write filters; if specified, it overrides the so_{rcv\|snd}.sb_lowat values in the filter. Inspired by: "Ronald F. Guilmette" <rfg@monkeys.com>
# 72967	23-Feb-2001	jlemon	When returning EV_EOF for the socket read/write filters, also return the current socket error in fflags. This may be useful for determining why a connect() request fails. Inspired by: "Jonathan Graehl" <jonathan@graehl.org>
# 72786	21-Feb-2001	rwatson	o Move per-process jail pointer (p->pr_prison) to inside of the subject credential structure, ucred (cr->cr_prison). o Allow jail inheritence to be a function of credential inheritence. o Abstract prison structure reference counting behind pr_hold() and pr_free(), invoked by the similarly named credential reference management functions, removing this code from per-ABI fork/exit code. o Modify various jail() functions to use struct ucred arguments instead of struct proc arguments. o Introduce jailed() function to determine if a credential is jailed, rather than directly checking pointers all over the place. o Convert PRISON_CHECK() macro to prison_check() function. o Move jail() function prototypes to jail.h. o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the flag in the process flags field itself. o Eliminate that "const" qualifier from suser/p_can/etc to reflect mutex use. Notes: o Some further cleanup of the linux/jail code is still required. o It's now possible to consider resolving some of the process vs credential based permission checking confusion in the socket code. o Mutex protection of struct prison is still not present, and is required to protect the reference count plus some fields in the structure. Reviewed by: freebsd-arch Obtained from: TrustedBSD Project
# 72521	15-Feb-2001	jlemon	Extend kqueue down to the device layer. Backwards compatible approach suggested by: peter
# 72471	14-Feb-2001	jlemon	Return ECONNABORTED from accept if connection is closed while on the listen queue, as well as the current behavior of a zero-length sockaddr. Obtained from: KAME Reviewed by: -net
# 71350	21-Jan-2001	des	First step towards an MP-safe zone allocator: - have zalloc() and zfree() always lock the vm_zone. - remove zalloci() and zfreei(), which are now redundant. Reviewed by: bmilekic, jasone
# 70254	21-Dec-2000	bmilekic	* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT. This is because calls with M_WAIT (now M_TRYWAIT) may not wait forever when nothing is available for allocation, and may end up returning NULL. Hopefully we now communicate more of the right thing to developers and make it very clear that it's necessary to check whether calls with M_(TRY)WAIT also resulted in a failed allocation. M_TRYWAIT basically means "try harder, block if necessary, but don't necessarily wait forever." The time spent blocking is tunable with the kern.ipc.mbuf_wait sysctl. M_WAIT is now deprecated but still defined for the next little while. * Fix a typo in a comment in mbuf.h * Fix some code that was actually passing the mbuf subsystem's M_WAIT to malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the value of the M_WAIT flag, this could have became a big problem.
# 69781	08-Dec-2000	dwmalone	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
# 68924	19-Nov-2000	alfred	Accept filters broke kernels compiled without options INET. Make accept filters conditional on INET support to fix. Pointed out by: bde Tested and assisted by: Stephen J. Kiernan <sab@vegamuse.org>
# 66420	28-Sep-2000	jlemon	Check so_error in filt_so{read\|write} in order to detect UDP errors. PR: 21601
# 65495	05-Sep-2000	truckman	Remove uidinfo hash table lookup and maintenance out of chgproccnt() and chgsbsize(), which are called rather frequently and may be called from an interrupt context in the case of chgsbsize(). Instead, do the hash table lookup and maintenance when credentials are changed, which is a lot less frequent. Add pointers to the uidinfo structures to the ucred and pcred structures for fast access. Pass a pointer to the credential to chgproccnt() and chgsbsize() instead of passing the uid. Add a reference count to the uidinfo structure and use it to decide when to free the structure rather than freeing the structure when the resource consumption drops to zero. Move the resource tracking code from kern_proc.c to kern_resource.c. Move some duplicate code sequences in kern_prot.c to separate helper functions. Change KASSERTs in this code to unconditional tests and calls to panic().
# 65198	29-Aug-2000	green	Remove any possibility of hiwat-related race conditions by changing the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and a u_long target to set it to. The whole thing is splnet(). This fixes a problem that jdp has been able to provoke.
# 64349	07-Aug-2000	jlemon	Make the kqueue socket read filter honor the SO_RCVLOWAT value. Spotted by: "Steve M." <stevem@redlinenetworks.com>
# 63646	20-Jul-2000	alfred	only allow accept filter modifications on listening sockets Submitted by: ps
# 61976	22-Jun-2000	alfred	fix races in the uidinfo subsystem, several problems existed: 1) while allocating a uidinfo struct malloc is called with M_WAITOK, it's possible that while asleep another process by the same user could have woken up earlier and inserted an entry into the uid hash table. Having redundant entries causes inconsistancies that we can't handle. fix: do a non-waiting malloc, and if that fails then do a blocking malloc, after waking up check that no one else has inserted an entry for us already. 2) Because many checks for sbsize were done as "test then set" in a non atomic manner it was possible to exceed the limits put up via races. fix: instead of querying the count then setting, we just attempt to set the count and leave it up to the function to return success or failure. 3) The uidinfo code was inlining and repeating, lookups and insertions and deletions needed to be in their own functions for clarity. Reviewed by: green
# 61837	19-Jun-2000	alfred	return of the accept filter part II accept filters are now loadable as well as able to be compiled into the kernel. two accept filters are provided, one that returns sockets when data arrives the other when an http request is completed (doesn't work with 0.9 requests) Reviewed by: jmg
# 61799	18-Jun-2000	alfred	backout accept optimizations. Requested by: jmg, dcs, jdp, nate
# 61714	15-Jun-2000	alfred	add socketoptions DELAYACCEPT and HTTPACCEPT which will not allow an accept() until the incoming connection has either data waiting or what looks like a HTTP request header already in the socketbuffer. This ought to reduce the context switch time and overhead for processing requests. The initial idea and code for HTTPACCEPT came from Yahoo engineers and has been cleaned up and a more lightweight DELAYACCEPT for non-http servers has been added Reviewed by: silence on hackers.
# 61633	13-Jun-2000	asmodai	Fix panic by moving the prp == 0 check up the order of sanity checks. Submitted by: Bart Thate <freebsd@1st.dudi.org> on -current Approved by: rwatson
# 61235	04-Jun-2000	rwatson	o Modify jail to limit creation of sockets to UNIX domain sockets, TCP/IP (v4) sockets, and routing sockets. Previously, interaction with IPv6 was not well-defined, and might be inappropriate for some environments. Similarly, sysctl MIB entries providing interface information also give out only addresses from those protocol domains. For the time being, this functionality is enabled by default, and toggleable using the sysctl variable jail.socket_unixiproute_only. In the future, protocol domains will be able to determine whether or not they are ``jail aware''. o Further limitations on process use of getpriority() and setpriority() by jailed processes. Addresses problem described in kern/17878. Reviewed by: phk, jmg
# 60938	26-May-2000	jake	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 60833	23-May-2000	jake	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# 59288	16-Apr-2000	jlemon	Introduce kqueue() and kevent(), a kernel event notification facility.
# 58225	18-Mar-2000	fenner	Make sure to free the socket in soabort() if the protocol couldn't free it (this could happen if the protocol already freed its part and we just kept the socket around to make sure accept(2) didn't block)
# 55943	14-Jan-2000	jasone	Add aio_waitcomplete(). Make aio work correctly for socket descriptors. Make gratuitous style(9) fixes (me, not the submitter) to make the aio code more readable. PR: kern/12053 Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>
# 55126	27-Dec-1999	green	Correct an uninitialized variable use, which, unlike most times, is actually a bug this time. Submitted by: bde Reviewed by: bde
# 54478	12-Dec-1999	green	This is Bosko Milekic's mbuf allocation waiting code. Basically, this means that running out of mbuf space isn't a panic anymore, and code which runs out of network memory will sleep to wait for it. Submitted by: Bosko Milekic <bmilekic@dsuper.net> Reviewed by: green, wollman
# 53541	22-Nov-1999	shin	KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP for IPv6 yet) With this patch, you can assigne IPv6 addr automatically, and can reply to IPv6 ping. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project
# 53212	16-Nov-1999	phk	This is a partial commit of the patch from PR 14914: Alot of the code in sys/kern directly accesses the Q_HEAD and Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. This batch of changes compile to the same object files. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
# 52070	09-Oct-1999	green	Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total usage limit.
# 51381	19-Sep-1999	green	Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually. Make a sonewconn3() which takes an extra argument (proc) so new sockets created with sonewconn() from a user's system call get the correct credentials, not just the parent's credentials.
# 50477	27-Aug-1999	peter	$Id$ -> $FreeBSD$
# 47992	17-Jun-1999	green	Reviewed by: the cast of thousands This is the change to struct sockets that gets rid of so_uid and replaces it with a much more useful struct pcred *so_cred. This is here to be able to do socket-level credential checks (i.e. IPFW uid/gid support, to be added to HEAD soon). Along with this comes an update to pidentd which greatly simplifies the code necessary to get a uid from a socket. Soon to come: a sysctl() interface to finding individual sockets' credentials.
# 47720	04-Jun-1999	peter	Plug a mbuf leak in tcp_usr_send(). pru_send() routines are expected to either enqueue or free their mbuf chains, but tcp_usr_send() was dropping them on the floor if the tcpcb/inpcb has been torn down in the middle of a send/write attempt. This has been responsible for a wide variety of mbuf leak patterns, ranging from slow gradual leakage to rather rapid exhaustion. This has been a problem since before 2.2 was branched and appears to have been fixed in rev 1.16 and lost in 1.23/1.28. Thanks to Jayanth Vijayaraghavan <jayanth@yahoo-inc.com> for checking (extensively) into this on a live production 2.2.x system and that it was the actual cause of the leak and looks like it fixes it. The machine in question was loosing (from memory) about 150 mbufs per hour under load and a change similar to this stopped it. (Don't blame Jayanth for this patch though) An alternative approach to this would be to recheck SS_CANTSENDMORE etc inside the splnet() right before calling pru_send() after all the potential sleeps, interrupts and delays have happened. However, this would mean exposing knowledge of the tcp stack's reset handling and removal of the pcb to the generic code. There are other things that call pru_send() directly though. Problem originally noted by: John Plevyak <jplevyak@inktomi.com>
# 47364	21-May-1999	ache	Realy fix overflow on SO_*TIMEO Submitted by: bde
# 46381	03-May-1999	billf	Add sysctl descriptions to many SYSCTL_XXXs PR: kern/11197 Submitted by: Adrian Chadd <adrian@FreeBSD.org> Reviewed by: billf(spelling/style/minor nits) Looked at by: bde(style)
# 46014	24-Apr-1999	ache	Lite2 bugfixes merge: so_linger is in seconds, not in 1/HZ range checking in SO_*TIMEO was wrong PR: 11252
# 44078	16-Feb-1999	dfr	* Change sysctl from using linker_set to construct its tree using SLISTs. This makes it possible to change the sysctl tree at runtime. * Change KLD to find and register any sysctl nodes contained in the loaded file and to unregister them when the file is unloaded. Reviewed by: Archie Cobbs <archie@whistle.com>, Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
# 43523	02-Feb-1999	fenner	Fix the port of the NetBSD 19990120-accept fix. I misread a piece of code when examining their fix, which caused my code (in rev 1.52) to: - panic("soaccept: !NOFDREF") - fatal trap 12, with tracebacks going thru soclose and soaccept
# 43301	27-Jan-1999	dillon	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 43196	25-Jan-1999	fenner	Port NetBSD's 19990120-accept bug fix. This works around the race condition where select(2) can return that a listening socket has a connected socket queued, the connection is broken, and the user calls accept(2), which then blocks because there are no connections queued. Reviewed by: wollman Obtained from: NetBSD (ftp://ftp.NetBSD.ORG/pub/NetBSD/misc/security/patches/19990120-accept)
# 42903	20-Jan-1999	fenner	Also consider the space left in the socket buffer when deciding whether to set PRUS_MORETOCOME.
# 42902	20-Jan-1999	fenner	Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This flag means that there is more data to be put into the socket buffer. Use it in TCP to reduce the interaction between mbuf sizes and the Nagle algorithm. Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's fix for this problem.
# 42453	09-Jan-1999	eivind	KNFize, by bde.
# 42408	08-Jan-1999	eivind	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
# 41591	07-Dec-1998	archie	The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static and local variables, goto labels, and functions declared but not defined.
# 41086	11-Nov-1998	truckman	Installed the second patch attached to kern/7899 with some changes suggested by bde, a few other tweaks to get the patch to apply cleanly again and some improvements to the comments. This change closes some fairly minor security holes associated with F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN had on tty devices. For more details, see the description on the PR. Because this patch increases the size of the proc and pgrp structures, it is necessary to re-install the includes and recompile libkvm, the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w. PR: kern/7899 Reviewed by: bde, elvind
# 38705	31-Aug-1998	wollman	Bow to tradition and correctly implement the bogus-but-hallowed semantics of getsockopt never telling how much it might have copied if only the buffer were big enough.
# 38699	31-Aug-1998	wollman	Correctly set the return length regardless of the relative size of the user's buffer. Simplify the logic a bit. (Can we have a version of min() for size_t?)
# 38482	23-Aug-1998	wollman	Yow! Completely change the way socket options are handled, eliminating another specialized mbuf type in the process. Also clean up some of the cruft surrounding IPFW, multicast routing, RSVP, and other ill-explored corners.
# 37740	18-Jul-1998	fenner	Undo rev 1.41 until we get more details about why it makes some systems fail.
# 37444	06-Jul-1998	fenner	Introduce (fairly hacky) workaround for odd TCP behavior with application writes of size (100,208]+N*MCLBYTES. The bug: sosend() hands each mbuf off to the protocol output routine as soon as it has copied it, in the hopes of increasing parallelism (see http://www.kohala.com/~rstevens/vanj.88jul20.txt ). This works well for TCP as long as the first mbuf handed off is at least the MSS. However, when doing small writes (between MHLEN and MINCLSIZE), the transaction is split into 2 small MBUF's and each is individually handed off to TCP. TCP assumes that the first small mbuf is the whole transaction, so sends a small packet. When the second small mbuf arrives, Nagle prevents TCP from sending it so it must wait for a (potentially delayed) ACK. This sends throughput down the toilet. The workaround: Set the "atomic" flag when we're doing small writes. The "atomic" flag has two meanings: 1. Copy all of the data into a chain of mbufs before handing off to the protocol. 2. Leave room for a datagram header in said mbuf chain. TCP wants the first but doesn't want the second. However, the second simply results in some memory wastage (but is why the workaround is a hack and not a fix). The real fix: The real fix for this problem is to introduce something like a "requested transfer size" variable in the socket->protocol interface. sosend() would then accumulate an mbuf chain until it exceeded the "requested transfer size". TCP could set it to the TCP MSS (note that the current interface causes strange TCP behaviors when the MSS > MCLBYTES; nobody notices because MCLBYTES > ethernet's MTU).
# 36079	15-May-1998	wollman	Convert socket structures to be type-stable and add a version number. Define a parameter which indicates the maximum number of sockets in a system, and use this to size the zone allocators used for sockets and for certain PCBs. Convert PF_LOCAL PCB structures to be type-stable and add a version number. Define an external format for infomation about socket structures and use it in several places. Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through sysctl(3) without blocking network interrupts for an unreasonable length of time. This probably still has some bugs and/or race conditions, but it seems to work well enough on my machines. It is now possible for `netstat' to get almost all of its information via the sysctl(3) interface rather than reading kmem (changes to follow).
# 34924	28-Mar-1998	bde	Moved some #includes from <sys/param.h> nearer to where they are actually used.
# 33955	01-Mar-1998	guido	Make sure that you can only bind a more specific address when it is done by the same uid. Obtained from: OpenBSD
# 33628	19-Feb-1998	fenner	Revert sosend() to its behavior from 4.3-Tahoe and before: if so_error is set, clear it before returning it. The behavior introduced in 4.3-Reno (to not clear so_error) causes potentially transient errors (e.g. ECONNREFUSED if the other end hasn't opened its socket yet) to be permanent on connected datagram sockets that are only used for writing. (soreceive() clears so_error before returning it, as does getsockopt(...,SO_ERROR,...).) Submitted by: Van Jacobson <van@ee.lbl.gov>, via a comment in the vat sources.
# 33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
# 33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
# 31053	09-Nov-1997	jkh	MF22: MSG_EOR bug fix. Submitted by: wollman
# 30354	12-Oct-1997	phk	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
# 30108	04-Oct-1997	phk	While booting diskless we have no proc pointer.
# 29352	14-Sep-1997	peter	Extend select backend for sockets to work with a poll interface (more detail is passed back and forwards). This mostly came from NetBSD, except that our interfaces have changed a lot and this funciton is in a different part of the kernel. Obtained from: NetBSD
# 29041	02-Sep-1997	bde	Removed unused #includes.
# 28551	21-Aug-1997	bde	#include <machine/limits.h> explicitly in the few places that it is required.
# 28270	16-Aug-1997	wollman	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# 26990	27-Jun-1997	peter	Don't accept insane values for SO_(SND\|RCV)BUF, and the low water marks. Specifically, don't allow a value < 1 for any of them (it doesn't make sense), and don't let the low water mark be greater than the corresponding high water mark. Pre-Approved by: wollman Obtained from: NetBSD
# 25201	27-Apr-1997	wollman	The long-awaited mega-massive-network-code- cleanup. Part I. This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in. 2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials. 3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed. 4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family. 5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem. As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
# 24131	23-Mar-1997	bde	Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined. Fixed everything that depended on getting fcntl.h stuff from the wrong place. Most things don't depend on file.h stuff at all.
# 23081	24-Feb-1997	wollman	Create a new branch of the kernel MIB, kern.ipc, to store all of the configurables and instrumentation related to inter-process communication mechanisms. Some variables, like mbuf statistics, are instrumented here for the first time. For mbuf statistics: also keep track of m_copym() and m_pullup() failures, and provide for the user's inspection the compiled-in values of MSIZE, MHLEN, MCLBYTES, and MINCLSIZE.
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 20030	29-Nov-1996	dg	Check for error return from uiomove to prevent looping endlessly in soreceive(). Closes PR#2114. Submitted by: wpaul
# 18787	07-Oct-1996	pst	Increase robustness of FreeBSD against high-rate connection attempt denial of service attacks. Reviewed by: bde,wollman,olah Inspired by: vjs@sgi.com
# 17096	11-Jul-1996	wollman	Modify the kernel to use the new pr_usrreqs interface rather than the old pr_usrreq mechanism which was poorly designed and error-prone. This commit renames pr_usrreq to pr_ousrreq so that old code which depended on it would break in an obvious manner. This commit also implements the new interface for TCP, although the old function is left as an example (#ifdef'ed out). This commit ALSO fixes a longstanding bug in the TCP timer processing (introduced by davidg on 1995/04/12) which caused timer processing on a TCB to always stop after a single timer had expired (because it misinterpreted the return value from tcp_usrreq() to indicate that the TCB had been deleted). Finally, some code related to polling has been deleted from if.c because it is not relevant t -current and doesn't look at all like my current code.
# 15701	09-May-1996	wollman	Make it possible to return more than one piece of control information (PR #1178). Define a new SO_TIMESTAMP socket option for datagram sockets to return packet-arrival timestamps as control information (PR #1179). Submitted by: Louis Mamakos <loiue@TransSys.com>
# 15269	16-Apr-1996	dg	Fix for PR #1146: the "next" pointer must be cached before calling soabort since the struct containing it may be freed.
# 14547	11-Mar-1996	dg	Changed socket code to use 4.4BSD queue macros. This includes removing the obsolete soqinsque and soqremque functions as well as collapsing so_q0len and so_qlen into a single queue length of unaccepted connections. Now the queue of unaccepted & complete connections is checked directly for queued sockets. The new code should be functionally equivilent to the old while being substantially faster - especially in cases where large numbers of connections are often queued for accept (e.g. http).
# 14093	13-Feb-1996	wollman	Kill XNS. While we're at it, fix socreate() to take a process argument. (This was supposed to get committed days ago...)
# 13955	07-Feb-1996	wollman	Define a new socket option, SO_PRIVSTATE. Getting it returns the state of the SS_PRIV flag in so_state; setting it always clears same.
# 12843	14-Dec-1995	bde	Nuked ambiguous sleep message strings: old: new: netcls[] = "netcls" "soclos" netcon[] = "netcon" "accept", "connec" netio[] = "netio" "sblock", "sbwait"
# 12041	03-Nov-1995	wollman	Make somaxconn (maximum backlog in a listen(2) request) and sb_max (maximum size of a socket buffer) tunable. Permit callers of listen(2) to specify a negative backlog, which is translated into somaxconn. Previously, a negative backlog was silently translated into 0.
# 10273	25-Aug-1995	bde	Remove extra arg from one of the calls to (*pr_usrreq)().
# 8876	30-May-1995	rgrimes	Remove trailing whitespace.
# 6476	15-Feb-1995	wollman	getsockopt(s, SOL_SOCKET, SO_SNDTIMEO, ...) would construct the returned timeval incorrectly, truncating the usec part. Obtained from: Stevens vol. 2 p. 548
# 6222	07-Feb-1995	wollman	Merge in the socket-level support for Transaction TCP.
# 6211	06-Feb-1995	dg	Use M_NOWAIT instead of M_KERNEL for socket allocations; it is apparantly possible for certain socket operations to occur during interrupt context. Submitted by: John Dyson
# 6127	02-Feb-1995	dg	Calling semantics for kmem_malloc() have been changed...and the third argument is now more than just a single flag. (kern_malloc.c) Used new M_KERNEL value for socket allocations that previous were "M_NOWAIT". Note that this will change when we clean up the M_ namespace mess. Submitted by: John Dyson
# 3308	02-Oct-1994	phk	All of this is cosmetic. prototypes, #includes, printfs and so on. Makes GCC a lot more silent.
# 1817	02-Aug-1994	dg	Added $Id$
# 1623	29-May-1994	dg	Changed mbuf allocation policy to get a cluster if size > MINCLSIZE. Makes a BIG difference in socket performance.
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources