History log of /freebsd-10.3-release/sys/kern/uipc_sockbuf.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 296373 04-Mar-2016 marius

- Copy stable/10@296371 to releng/10.3 in preparation for 10.3-RC1
builds.
- Update newvers.sh to reflect RC1.
- Update __FreeBSD_version to reflect 10.3.
- Update default pkg(8) configuration to use the quarterly branch.

Approved by: re (implicit)

# 274043 03-Nov-2014 hselasky

MFC r271946 and r272595:
Improve transmit sending offload, TSO, algorithm in general. This
change allows all HCAs from Mellanox Technologies to function properly
when TSO is enabled. See r271946 and r272595 for more details about
this commit.

Sponsored by: Mellanox Technologies


# 268432 08-Jul-2014 delphij

Fix kernel memory disclosure in control message and SCTP notifications.

Security: FreeBSD-SA-14:17.kmem
Security: CVE-2014-3952, CVE-2014-3953


# 263820 27-Mar-2014 asomers

MFC r262867

Fix PR kern/185813 "SOCK_SEQPACKET AF_UNIX sockets with asymmetrical buffers
drop packets". It was caused by a check for the space available in a
sockbuf, but it was checking the wrong sockbuf.

sys/sys/sockbuf.h
sys/kern/uipc_sockbuf.c
Add sbappendaddr_nospacecheck_locked(), which is just like
sbappendaddr_locked but doesn't validate the receiving socket's space.
Factor out common code into sbappendaddr_locked_internal(). We
shouldn't simply make sbappendaddr_locked check the space and then call
sbappendaddr_nospacecheck_locked, because that would cause the O(n)
function m_length to be called twice.

sys/kern/uipc_usrreq.c
Use sbappendaddr_nospacecheck_locked for SOCK_SEQPACKET sockets,
because the receiving sockbuf's size limit is irrelevant.

tests/sys/kern/unix_seqpacket_test.c
Now that 185813 is fixed, pipe_128k_8k fails intermittently due to
185812. Make it fail every time by adding a usleep after starting the
writer thread and before starting the reader thread in test_pipe. That
gives the writer time to fill up its send buffer. Also, clear the
expected failure message due to 185813. It actually said "185812", but
that was a typo.

PR: kern/185813


# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 256185 09-Oct-2013 glebius

- Substitute sbdrop_internal() with sbcut_internal(). The latter doesn't free
mbufs, but return chain of free mbufs to a caller. Caller can either reuse
them or return to allocator in a batch manner.
- Implement sbdrop()/sbdrop_locked() as a wrapper around sbcut_internal().
- Expose sbcut_locked() for outside usage.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Approved by: re (marius)


# 255138 01-Sep-2013 davide

Fix socket buffer timeouts precision using the new sbintime_t KPI instead
of relying on the tvtohz() workaround. The latter has been introduced
lately by jhb@ (r254699) in order to have a fix that can be backported
to STABLE.

Reported by: Vitja Makarov <vitja.makarov at gmail dot com>
Reviewed by: jhb (earlier version)


# 251984 19-Jun-2013 lstewart

When a previous call to sbsndptr() leaves sb->sb_sndptroff at the start of an
mbuf that was fully consumed by the previous call, the mbuf ptr returned by the
current call ends up being the previous mbuf in the sb chain to the one that
contains the data we want.

This does not cause any observable issues because the mbuf copy routines happily
walk the mbuf chain to get to the data at the moff offset, which in this case
means they effectively skip over the mbuf returned by sbsndptr().

We can't adjust sb->sb_sndptr during the previous call for this case because the
next mbuf in the chain may not exist yet. We therefore need to detect the
condition and make the adjustment during the current call.

Fix by detecting the special case of moff being at the start of the next mbuf in
the chain and adjust the required accounting variables accordingly.

Reviewed by: andre
MFC after: 2 weeks


# 248886 29-Mar-2013 glebius

Once ng_ksocket(4) is fixed, re-apply r194662. See this revision for
longer description.

Discussed with: andre, rwatson
Sponsored by: Nginx, Inc.


# 248318 15-Mar-2013 glebius

Use m_get() and m_getcl() instead of compat macros.


# 243882 05-Dec-2012 glebius

Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually


# 228449 13-Dec-2011 eadler

Document a large number of currently undocumented sysctls. While here
fix some style(9) issues and reduce redundancy.

PR: kern/155491
PR: kern/155490
PR: kern/155489
Submitted by: Galimov Albert <wtfcrap@mail.ru>
Approved by: bde
Reviewed by: jhb
MFC after: 1 week


# 225169 25-Aug-2011 bz

Increase the defaults for the maximum socket buffer limit,
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.

For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.

Note that this is just the defaults. They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption. All values are still tunable
by sysctls.

Suggested by: gnn
Discussed on: arch (Mar and Aug 2011)
MFC after: 3 weeks
Approved by: re (kib)


# 220622 14-Apr-2011 glebius

Revert r194662, since it breaks ng_ksocket(4) and may break
other socket consumers with alternate sb_upcall.

PR: kern/154676
Submitted by: Arnaud Lacombe <lacombar gmail.com>
MFC after: 7 days


# 194662 22-Jun-2009 andre

In sbappendstream_locked() demote all incoming packet mbufs (and
chains) to pure data mbufs using m_demote(). This removes the
packet header and all m_tag information as they are not meaningful
anymore on a stream socket where mbufs are linked through m->m_next.
Strictly speaking a packet header can be only ever valid on the first
mbuf in an m_next chain.

sbcompress() was doing this already when the mbuf chain layout lent
itself to it (e.g. header splitting or merge-append), just not
consistently.

This frees resources at socket buffer append time instead of at
sbdrop_internal() time after data has been read from the socket.

For MAC the per packet information has done its duty and during
socket buffer appending the policy of the socket itself takes over.
With the append the packet boundaries disappear naturally and with
it any context that was based on it. None of the residual information
from mbuf headers in the socket buffer on stream sockets was looked at.


# 193272 01-Jun-2009 jhb

Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem


# 191366 21-Apr-2009 emax

Fix sbappendrecord_locked().

The main problem is that sbappendrecord_locked() relies on sbcompress()
to set sb_mbtail. This will not happen if sbappendrecord_locked() is
called with mbuf chain made of exactly one mbuf (i.e. m0->m_next == NULL).
In this case sbcompress() will be called with m == NULL and will do
nothing. I'm not entirely sure if m == NULL is a valid argument for
sbcompress(), and, it rather pointless to call it like that, but keep
calling it so it can do SBLASTMBUFCHK().

The problem is triggered by the SOCKBUF_DEBUG kernel option that
enables SBLASTRECORDCHK() and SBLASTMBUFCHK() checks.

PR: kern/126742
Investigated by: pluknet < pluknet -at- gmail -dot- com >
No response from: freebsd-current@, freebsd-bluetooth@
MFC after: 3 days


# 183663 07-Oct-2008 rwatson

Rewrite sbreserve_locked()'s comment on NULL thread pointers, eliminating
an XXXRW about the comment being stale.

MFC after: 3 days


# 182842 07-Sep-2008 bz

Catch a possible NULL pointer deref in case the offsets got mangled
somehow.
As a consequence we may now get an unexpected result(*).
Catch that error cases with a well defined panic giving appropriate
pointers to ease debugging.

(*) While the concensus was that the case should never happen unless
there was a bug, noone was definitively sure.

Discussed with: kmacy (about 8 months back)
Reviewed by: silby (as part of a larger patch in March)
MFC after: 2 months


# 179027 15-May-2008 gnn

Update the kernel to count the number of mbufs and clusters
(all types) used per socket buffer.

Add support to netstat to print out all of the socket buffer
statistics.

Update the netstat manual page to describe the new -x flag
which gives the extended output.

Reviewed by: rwatson, julian


# 175968 04-Feb-2008 rwatson

Further clean up sorflush:

- Expose sbrelease_internal(), a variant of sbrelease() with no
expectations about the validity of locks in the socket buffer.
- Use sbrelease_internel() in sorflush(), and as a result avoid intializing
and destroying a socket buffer lock for the temporary stack copy of the
actual buffer, asb.
- Add a comment indicating why we do what we do, and remove an XXX since
things have gotten less ugly in sorflush() lately.

This makes socket close cleaner, and possibly also marginally faster.

MFC after: 3 weeks


# 175845 31-Jan-2008 rwatson

Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
might be sleeping (potentially indefinitely) while holding sblock(),
such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
delivered to the thread calling sorflush() doesn't cause sblock() to
fail. The sblock() is required to ensure that all other socket
consumer threads have, in fact, left, and do not enter, the socket
buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag. When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by: bz, kmacy
Reported by: Jos Backus <jos at catnook dot com>


# 174711 17-Dec-2007 kmacy

Add SB_NOCOALESCE flag to disable socket buffer update in place


# 174647 16-Dec-2007 jeff

Refactor select to reduce contention and hide internal implementation
details from consumers.

- Track individual selecters on a per-descriptor basis such that there
are no longer collisions and after sleeping for events only those
descriptors which triggered events must be rescaned.
- Protect the selinfo (per descriptor) structure with a mtx pool mutex.
mtx pool mutexes were chosen to preserve api compatibility with
existing code which does nothing but bzero() to setup selinfo
structures.
- Use a per-thread wait channel rather than a global wait channel.
- Hide select implementation details in a seltd structure which is
opaque to the rest of the kernel.
- Provide a 'selsocket' interface for those kernel consumers who wish to
select on a socket when they have no fd so they no longer have to
be aware of select implementation details.

Tested by: kris
Reviewed on: arch


# 172557 12-Oct-2007 mohans

Set the NFS server sockbuf high watermarks to the system defaults
(up form 32KB). The low highwatermark setting caused UDP fullsock
request drops, throttling thruput greatly.
Reported by: Kris Kennaway
Approved by: re@ (Ken Smith)


# 170151 31-May-2007 rwatson

Now that sx(9) locks support an interruptible lock acquire primitive,
properly observe the SB_NOINTR flag in sblock. This restores the
required behavior that lock acquisition be interruptible on the socket
buffer I/O serialization lock to allow threads waiting for I/O to be
signaled even if they aren't the thread currently holding the I/O lock.
With this change, the sblock regression test is again passed.

Reported by: alfred
sx(9) handiwork: attilio


# 169624 16-May-2007 rwatson

Generally migrate to ANSI function headers, and remove 'register' use.


# 169236 03-May-2007 rwatson

sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags
on each socket buffer with the socket buffer's mutex. This sleep lock is
used to serialize I/O on sockets in order to prevent I/O interlacing.

This change replaces the custom sleep lock with an sx(9) lock, which
results in marginally better performance, better handling of contention
during simultaneous socket I/O across multiple threads, and a cleaner
separation between the different layers of locking in socket buffers.
Specifically, the socket buffer mutex is now solely responsible for
serializing simultaneous operation on the socket buffer data structure,
and not for I/O serialization.

While here, fix two historic bugs:

(1) a bug allowing I/O to be occasionally interlaced during long I/O
operations (discovere by Isilon).

(2) a bug in which failed non-blocking acquisition of the socket buffer
I/O serialization lock might be ignored (discovered by sam).

SCTP portion of this patch submitted by rrs.


# 167902 26-Mar-2007 rwatson

Following movement of functions from uipc_socket2.c to uipc_socket.c and
uipc_sockbuf.c, clean up and update comments.


# 167895 26-Mar-2007 rwatson

Complete removal of uipc_socket2.c by moving the last few functions to
other C files:

- Move sbcreatecontrol() and sbtoxsockbuf() to uipc_sockbuf.c. While
sbcreatecontrol() is really an mbuf allocation routine, it does its work
with awareness of the layout of socket buffer memory.

- Move pru_*() protocol switch stubs to uipc_socket.c where the non-stub
versions of several of these functions live. Likewise, move socket state
transition calls (soisconnecting(), etc) to uipc_socket.c. Moveo
sodupsockaddr() and sotoxsocket().


# 167715 19-Mar-2007 andre

Maintain a pointer and offset pair into the socket buffer mbuf chain to
avoid traversal of the entire socket buffer for larger offsets on stream
sockets.

Adjust tcp_output() make use of it.

Tested by: gallatin


# 162086 06-Sep-2006 jhb

Use sysctl_handle_long() instead of duplicating it's logic for
kern.ipc.maxsockbuf so that this sysctl works for 32-bit binaries running
on amd64 via compat/freebsd32.

MFC after: 3 days


# 160915 02-Aug-2006 rwatson

Remove 'register'.
Use ANSI C prototypes/function headers.
More deterministically line wrap comments.


# 160875 01-Aug-2006 rwatson

Reimplement socket buffer tear-down in sofree(): as the socket is no
longer referenced by other threads (hence our freeing it), we don't need
to set the can't send and can't receive flags, wake up the consumers,
perform two levels of locking, etc. Implement a fast-path teardown,
sbdestroy(), which flushes and releases each socket buffer. A manual
dom_dispose of the receive buffer is still required explicitly to GC
any in-flight file descriptors, etc, before flushing the buffer.

This results in a 9% UP performance improvement and 16% SMP performance
improvement on a tight loop of socket();close(); in micro-benchmarking,
but will likely also affect CPU-bound macro-benchmark performance.


# 160621 24-Jul-2006 rwatson

Remove non-socket buffer routines from uipc_sockbuf.c, and socket buffer
specific routines from uipc_socket2.c following repo-copy. We might
rethink the location of one or two at some point, but the division was
relatively clean. uipc_sockbuf.c is now the home of routines that
manipulate socket buffers.


# 160281 11-Jul-2006 rwatson

Several protocol switch functions (pru_abort, pru_detach, pru_sosetlabel)
return void, so don't implement no-op versions of these functions.
Instead, consistently check if those switch pointers are NULL before
invoking them.


# 160135 06-Jul-2006 rwatson

Remove now unneeded opt_mac.h and mac.h includes.


# 159706 17-Jun-2006 rwatson

Remove sbinsertoob(), sbinsertoob_locked(). They violate (and have
basically always violated) invariannts of soreceive(), which assume
that the first mbuf pointer in a receive socket buffer can't change
while the SB_LOCK sleepable lock is held on the socket buffer,
which is precisely what these functions do. No current protocols
invoke these functions, and removing them will help discourage them
from ever being used. I should have removed them years ago, but
lost track of it.

MFC after: 1 week
Prodded almost by accident by: peter


# 159481 10-Jun-2006 rwatson

Move some functions and definitions from uipc_socket2.c to uipc_socket.c:

- Move sonewconn(), which creates new sockets for incoming connections on
listen sockets, so that all socket allocate code is together in
uipc_socket.c.

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
to sysctl.h and remove lots of scattered implementations in various
IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
reasons. Statisticize soalloc() and sodealloc() as they are now
required only in uipc_socket.c, and are internal to the socket
implementation.

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after: 1 month


# 157927 21-Apr-2006 ps

Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.


# 157370 01-Apr-2006 rwatson

Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.

MFC after: 3 months


# 157366 01-Apr-2006 rwatson

Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful. Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit. This will be corrected shortly in followup
commits to these components.

MFC after: 3 months


# 157158 26-Mar-2006 rwatson

Add a sysctl, regression.sonewconn_earlytest, which when options
REGRESSION is enabled, allows user space to dictate that sonewconn()
should skip it's "skip the hard work" check to see if the listen
queue is full, and instead proceed with allocation of a socket and
trimming of the overflowed queue. This makes it easier to test the
queue overflow logic.

MFC after: 1 month


# 156763 16-Mar-2006 rwatson

Change soabort() from returning int to returning void, since all
consumers ignore the return value, soabort() is required to succeed,
and protocols produce errors here to report multiple freeing of the
pcb, which we hope to eliminate.


# 152672 22-Nov-2005 jdp

Fix a bug in the loop in sonewconn that makes room on the incomplete
connection queue for a new connection. It was removing connections
from the wrong list.

Submitted by: Paul Mikesell
Sponsored by: Isilon Systems
MFC after: 1 week


# 151967 02-Nov-2005 andre

Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005


# 151888 30-Oct-2005 rwatson

Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN. This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit. This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface. This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.


# 150280 18-Sep-2005 rwatson

Re-comment sbcompress() to explain what it is it does; it took me
quite a bit of reading to figure it out, and I want to avoid figuring
it out again.

Convert an if (foo) else printf("this is almost a panic") into a
KASSERT.

MFC after: 3 days


# 147730 01-Jul-2005 ssouhlal

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# 146688 27-May-2005 rwatson

In the current world order, each socket has two mutexes: a mutex
that protects socket and receive socket buffer state, and a second
mutex to protect send socket buffer state. In some places, the
mutex shared between the socket and receive socket buffer will be
acquired twice, once by each layer, resulting in some
inconsistency, but providing the abstraction benefit of being able
to more easily separate the two mutexes in the future if desired.

When transitioning a socket to the SS_ISDISCONNECTING or
SS_ISDISCONNECTED states, grab the socket/receive socket buffer lock
once rather than grabbing it as the socket lock, modifying socket
state, then grabbing a second time as the receive lock in order to
modify the socket buffer state to indicate no further data can be
read. This change is believed to close a race between the change in
socket state and the change in socket buffer state, which for a
remotely initiated close on a UNIX domain socket, resulted in
soreceive() returning ENOTCONN rather than an EOF condition.

A similar race still exists in the case of send, however, and is
harder to fix as the socket and send socket buffer mutexes are not
the same, and we would like to avoid holding combinations of socket
mutexes over sb_upcall until we've finished clarifying the locking
protocol for upcalls.

This change has the side affect of reducing the number of mutex
operations to initiate disconnect or perform disconnect on a
socket by two.

PR: 78824
Rerported by: Marc Olzheim <marcolz@stack.nl>
MFC after: 2 weeks


# 143465 12-Mar-2005 rwatson

Extend the coverage of the accept and socket mutexes in soisconnected()
so that the socket lock is held over the test-and-set removal of the
accept filter option during connect, and the two socket mutex regions
(transition to connected, perform accept filter) are combined.


# 143245 07-Mar-2005 rwatson

When upcalling from a socket in soisconnected() for an accept filter,
call with flag M_DONTWAIT rather than M_TRYWAIT, as we don't want to
do blocking memory allocation (etc) in the netisr.

MFC after: 3 days


# 142128 20-Feb-2005 rwatson

Prefer NULL to returning 0 cast to a pointer type.

MFC after: 3 days


# 142015 17-Feb-2005 rwatson

In sonewconn(), set the new socket's state to show the protocol-provided
connection status before inserting the new socket into the listen
socket's accept queue, or there might be a race in which another thread
wakes up when the accept lock is released, and sees the socket before its
state is set correctly. The wakeup still occurs after the accept lock is
released. There have been no diagnoses of this bug in real-world systems
(as yet).

MFC after: 3 days


# 139804 06-Jan-2005 imp

/* -> /*- for copyright notices, minor format tweaks as necessary


# 139217 23-Dec-2004 rwatson

In sonewconn(), the s/if/while/ change to wait for room at the tail of
the accept queue is a feature, not a bug/issue, so remove the XXXRW
from the comment.


# 136987 27-Oct-2004 maxim

Fix a typo in a comparison appeared in rev. 1.125.

Submitted by: JINMEI Tatuya


# 136692 19-Oct-2004 andre

Support for dynamically loadable and unloadable protocols within existing protocol
families.

The protosw[] array of any particular protocol family ("domain") is of fixed size
defined at compile time. This made it impossible to dynamically add or remove any
protocols to or from it. We work around this by introducing so called SPACER's
which are embedded into the protosw[] array at compile time. The SPACER's have
a special protocol number (32767) to indicate the fact that they are SPACER's but
are otherwise NULL. Only as many protocols can be dynamically loaded as SPACER's
are provided in the protosw[] structure.

The pr_usrreqs structure is treated more special and contains pointers to dummy
functions only returning EOPNOTSUPP. This is needed because the use of those
functions pointers is usually not checked within the kernel because until now it
was assumed to be a valid function pointer. Instead of fixing all potential
callers we just return a proper error code.

Two new functions provide a clean API to register and unregister a protocol. The
register function expects a pointer to a valid and complete struct protosw including
a pointer to struct pru_usrreqs provided by the caller. Upon successful registration
the pr_init() function will be called to finish initialization of the protocol. The
unregister function restores the SPACER in place of the protocol again. It is the
responseability of the caller to ensure proper closing of all sockets and freeing
of memory allocation by the unloading protocol.

sys/protosw.h

o Define generic PROTO_SPACER to be 32767
o Prototypes for all pru_*_notsupp() functions
o Prototypes for pf_proto_[un]register() functions

kern/uipc_domain.c

o Global struct pr_usrreqs nousrreqs containing valid pointers to the
pru_*_notsupp() functions
o New functions pf_proto_[un]register()

kern/uipc_socket2.c

o New functions bodies for all pru_*_notsupp() functions


# 133741 15-Aug-2004 jmg

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 131151 26-Jun-2004 rwatson

Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:

- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
acquire the socket buffer lock and use the _locked() variants of both
calls. Note that the _locked() sowakeup() versions unlock the mutex on
return. This is done in uipc_send(), divert_packet(), mroute
socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().

- When the socket buffer lock is dropped before a sowakeup(), remove the
explicit unlock and use the _locked() sowakeup() variant. This is done
in soisdisconnecting(), soisdisconnected() when setting the can't send/
receive flags and dropping data, and in uipc_rcvd() which adjusting
back-pressure on the sockets.

For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.


# 131006 24-Jun-2004 rwatson

Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted. sbreserve() now acquires
the lock before calling sbreserve_locked(). In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order. In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.


# 130831 21-Jun-2004 rwatson

Merge next step in socket buffer locking:

- sowakeup() now asserts the socket buffer lock on entry. Move
the call to KNOTE higher in sowakeup() so that it is made with
the socket buffer lock held for consistency with other calls.
Release the socket buffer lock prior to calling into pgsigio(),
so_upcall(), or aio_swake(). Locking for this event management
will need revisiting in the future, but this model avoids lock
order reversals when upcalls into other subsystems result in
socket/socket buffer operations. Assert that the socket buffer
lock is not held at the end of the function.

- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
have _locked versions which assert the socket buffer lock on
entry. If a wakeup is required by sb_notify(), invoke
sowakeup(); otherwise, unconditionally release the socket buffer
lock. This results in the socket buffer lock being released
whether a wakeup is required or not.

- Break out socantsendmore() into socantsendmore_locked() that
asserts the socket buffer lock. socantsendmore()
unconditionally locks the socket buffer before calling
socantsendmore_locked(). Note that both functions return with
the socket buffer unlocked as socantsendmore_locked() calls
sowwakeup_locked() which has the same properties. Assert that
the socket buffer is unlocked on return.

- Break out socantrcvmore() into socantrcvmore_locked() that
asserts the socket buffer lock. socantrcvmore() unconditionally
locks the socket buffer before calling socantrcvmore_locked().
Note that both functions return with the socket buffer unlocked
as socantrcvmore_locked() calls sorwakeup_locked() which has
similar properties. Assert that the socket buffer is unlocked
on return.

- Break out sbrelease() into a sbrelease_locked() that asserts the
socket buffer lock. sbrelease() unconditionally locks the
socket buffer before calling sbrelease_locked().
sbrelease_locked() now invokes sbflush_locked() instead of
sbflush().

- Assert the socket buffer lock in socket buffer sanity check
functions sblastrecordchk(), sblastmbufchk().

- Assert the socket buffer lock in SBLINKRECORD().

- Break out various sbappend() functions into sbappend_locked()
(and variations on that name) that assert the socket buffer
lock. The !_locked() variations unconditionally lock the socket
buffer before calling their _locked counterparts. Internally,
make sure to call _locked() support routines, etc, if already
holding the socket buffer lock.

- Break out sbinsertoob() into sbinsertoob_locked() that asserts
the socket buffer lock. sbinsertoob() unconditionally locks the
socket buffer before calling sbinsertoob_locked().

- Break out sbflush() into sbflush_locked() that asserts the
socket buffer lock. sbflush() unconditionally locks the socket
buffer before calling sbflush_locked(). Update panic strings
for new function names.

- Break out sbdrop() into sbdrop_locked() that asserts the socket
buffer lock. sbdrop() unconditionally locks the socket buffer
before calling sbdrop_locked().

- Break out sbdroprecord() into sbdroprecord_locked() that asserts
the socket buffer lock. sbdroprecord() unconditionally locks
the socket buffer before calling sbdroprecord_locked().

- sofree() now calls socantsendmore_locked() and re-acquires the
socket buffer lock on return. It also now calls
sbrelease_locked().

- sorflush() now calls socantrcvmore_locked() and re-acquires the
socket buffer lock on return. Clean up/mess up other behavior
in sorflush() relating to the temporary stack copy of the socket
buffer used with dom_dispose by more properly initializing the
temporary copy, and selectively bzeroing/copying more carefully
to prevent WITNESS from getting confused by improperly
initialized mutexes. Annotate why that's necessary, or at
least, needed.

- soisconnected() now calls sbdrop_locked() before unlocking the
socket buffer to avoid locking overhead.

Some parts of this change were:

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# 130705 19-Jun-2004 rwatson

Assert socket buffer lock in sb_lock() to protect socket buffer sleep
lock state. Convert tsleep() into msleep() with socket buffer mutex
as argument. Hold socket buffer lock over sbunlock() to protect sleep
lock state.

Assert socket buffer lock in sbwait() to protect the socket buffer
wait state. Convert tsleep() into msleep() with socket buffer mutex
as argument.

Modify sofree(), sosend(), and soreceive() to acquire SOCKBUF_LOCK()
in order to call into these functions with the lock, as well as to
start protecting other socket buffer use in their implementation. Drop
the socket buffer mutexes around calls into the protocol layer, around
potentially blocking operations, for copying to/from user space, and
VM operations relating to zero-copy. Assert the socket buffer mutex
strategically after code sections or at the beginning of loops. In
some cases, modify return code to ensure locks are properly dropped.

Convert the potentially blocking allocation of storage for the remote
address in soreceive() into a non-blocking allocation; we may wish to
move the allocation earlier so that it can block prior to acquisition
of the socket buffer lock.

Drop some spl use.

NOTE: Some races exist in the current structuring of sosend() and
soreceive(). This commit only merges basic socket locking in this
code; follow-up commits will close additional races. As merged,
these changes are not sufficient to run without Giant safely.

Reviewed by: juli, tjr


# 130653 17-Jun-2004 rwatson

Merge additional socket buffer locking from rwatson_netperf:

- Lock down low hanging fruit use of sb_flags with socket buffer
lock.

- Lock down low hanging fruit use of so_state with socket lock.

- Lock down low hanging fruit use of so_options.

- Lock down low-hanging fruit use of sb_lowwat and sb_hiwat with
socket buffer lock.

- Annotate situations in which we unlock the socket lock and then
grab the receive socket buffer lock, which are currently actually
the same lock. Depending on how we want to play our cards, we
may want to coallesce these lock uses to reduce overhead.

- Convert a if()->panic() into a KASSERT relating to so_state in
soaccept().

- Remove a number of splnet()/splx() references.

More complex merging of socket and socket buffer locking to
follow.


# 130513 15-Jun-2004 rwatson

Grab the socket buffer send or receive mutex when performing a
read-modify-write on the sb_state field. This commit catches only
the "easy" ones where it doesn't interact with as yet unmerged
locking.


# 130480 14-Jun-2004 rwatson

The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_CANTRCVMORE (so_state)
SS_CANTSENDMORE (so_state)
SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.


# 130398 13-Jun-2004 rwatson

Socket MAC labels so_label and so_peerlabel are now protected by
SOCK_LOCK(so):

- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.


# 130050 04-Jun-2004 rwatson

Mark sun_noname as const since it's immutable. Update definitions
of functions that potentially accept &sun_noname (sbappendaddr(),
et al) to accept a const sockaddr pointer.


# 129979 02-Jun-2004 rwatson

Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

so_qlen so_incqlen so_qstate
so_comp so_incomp so_list
so_head

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and address
lock order concerns. In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
always allocate a file descriptor with falloc() so that if we do
find a socket, we don't have to encounter the "Oh, there wasn't
a socket" race that can occur if falloc() sleeps in the current
code, which broke inbound accept() ordering, not to mention
requiring backing out socket state changes in a way that raced
with the protocol level. We may want to add a lockless read of
the queue state if polling of empty queues proves to be important
to optimize.

- In accept1(), soref() the socket while holding the accept lock
so that the socket cannot be free'd in a race with the protocol
layer. Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
insert our new socket once we've committed to inserting it, or
races can occur that cause the incomplete socket queue to
overfill. In the previously implementation, it was sufficient
to simply tested once since calling soabort() didn't release
synchronization permitting another thread to insert a socket as
we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
caller to remove a socket from the incomplete connection queue
before calling soabort(), which prevents soabort() from having
to walk into the accept socket to release the socket from its
queue, and avoids races when releasing the accept mutex to enter
soabort(), permitting soabort() to avoid lock ordering issues
with the caller.

- Generally cluster accept queue related operations together
throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.


# 129916 01-Jun-2004 rwatson

The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket. This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.


# 129906 31-May-2004 bmilekic

Bring in mbuma to replace mballoc.

mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)


# 129410 19-May-2004 ps

syncache broke rev 1.23 which was done to fix the "thundering herd"
problem in Apache. Fix it.

Reviewed by: peter


# 127911 05-Apr-2004 imp

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 127299 22-Mar-2004 ps

Remove some netbsd debug code that crept into rev 1.116


# 126425 01-Mar-2004 rwatson

Rename dup_sockaddr() to sodupsockaddr() for consistency with other
functions in kern_socket.c.

Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".

Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.

Submitted by: sam


# 126411 29-Feb-2004 rwatson

Modify soalloc() API so that it accepts a malloc flags argument rather
than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1. This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.

Submitted by: sam


# 125454 04-Feb-2004 jhb

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 122875 18-Nov-2003 rwatson

Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 122352 09-Nov-2003 tanimura

- Implement selwakeuppri() which allows raising the priority of a
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.

Not objected in: -arch, -current


# 121628 28-Oct-2003 sam

speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by: ps/jayanth
Obtained from: netbsd (jason thorpe)


# 121307 21-Oct-2003 silby

Change all SYSCTLS which are readonly and have a related TUNABLE
from CTLFLAG_RD to CTLFLAG_RDTUN so that sysctl(8) can provide
more useful error messages.


# 118045 26-Jul-2003 scottl

Guard against MLEN growing larger than a uint8_t due to MSIZE grwoing to a
value of 512 in LINT. This keeps gcc from complaining.


# 116182 11-Jun-2003 obrien

Use __FBSDID().


# 114293 30-Apr-2003 markm

Fix some easy, global, lint warnings. In most cases, this means
making some local variables static. In a couple of cases, this means
removing an unused variable.


# 111227 21-Feb-2003 peter

Missing M_TRYWAIT from so_upcall third argument.


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 110268 03-Feb-2003 harti

Make the variable types, the sysctl macros and the sysctl handler for
kern.ipc.{maxsockbuf,sockbuf_waste_factor} to agree that those variables
are of type unsigned long.

PR: sparc64/47389
Approved by: jake (mentor)


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 109098 11-Jan-2003 tjr

Don't count mbufs with m_type == MT_HEADER or MT_OOBDATA as control data
in sballoc(), sbcompress(), sbdrop() and sbfree(). Fixes fstat() st_size
reporting and kevent() EVFILT_READ on TCP sockets.


# 106473 05-Nov-2002 kbyanc

Spotted a couple of places where the socket buffer's counters were being
manipulated directly (rather than using sballoc()/sbfree()); update them
to tweak the new sb_ctl field too.

Sponsored by: NTT Multimedia Communications Labs


# 106326 02-Nov-2002 alc

Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing
a panic because the socket's state isn't as expected by sofree().

Discussed with: dillon, fenner


# 103554 18-Sep-2002 phk

Use m_length() instead of home-rolled versions.


# 101996 16-Aug-2002 dg

Further improved the performance of sbreserve() by moving the calculation
of the adjusted sb_max into a sysctl handler for sb_max and assigning it to
a variable that is used instead. This eliminates the 32bit multiply and
divide from the fast path that was being done previously.


# 101966 16-Aug-2002 dg

Rewrote the space check algorithm in sbreserve() so that the extremely
expensive (!) 64bit multiply, divide, and comparison aren't necessary
(this came in originally from rev 1.19 to fix an overflow with large
sb_max or MCLBYTES).
The 64bit math in this function was measured in some kernel profiles as
being as much as 5-8% of the total overhead of the TCP/IP stack and
is eliminated with this commit. There is a harmless rounding error (of
about .4% with the standard values) introduced with this change,
however this is in the conservative direction (downward toward a
slightly smaller maximum socket buffer size).

MFC after: 3 days


# 101173 01-Aug-2002 rwatson

Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by: bde


# 101013 31-Jul-2002 rwatson

Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the necessary MAC entry points to maintain labels on sockets.
In particular, invoke entry points during socket allocation and
destruction, as well as creation by a process or during an
accept-scenario (sonewconn). For UNIX domain sockets, also assign
a peer label. As the socket code isn't locked down yet, locking
interactions are not yet clear. Various protocol stack socket
operations (such as peer label assignment for IPv4) will follow.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 100775 27-Jul-2002 dwmalone

If a socket is disconnected for some reason (like a TCP connection
not responding) then drop any data on the outgoing queue in
soisdisconnected because there is no way to get it to its destination
any longer.

The only objection to this patch I got on -net was from Terry, who
wasn't sure that the condition in question could arise, so I provided
some example code.


# 100712 26-Jul-2002 robert

Fix -Werror build for sparc64: Use the appropriate conversion
specifier for an 'unsigned int' argument.


# 98998 29-Jun-2002 alfred

More caddr_t removal.
Change struct knote's kn_hook from caddr_t to void *.


# 98385 18-Jun-2002 tanimura

Remove so*_locked(), which were backed out by mistake.


# 97658 31-May-2002 tanimura

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 97004 20-May-2002 silby

Subtle fix to the accept filter LRU code. In some cases, a newly
initialized socket with no qlimit was being passed in. In order
to handle this case properly, we must not use >= when comparing
queue sizes to qlimit. As a result of this improper handling,
a panic could result in certain cases.

PR: 38325
MFC after: 3 days


# 96972 20-May-2002 tanimura

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 96165 07-May-2002 tanimura

Do not forget to increase the number of completely connected sockets in
soisconnected_locked().

Forgotten by: tanimura


# 95883 01-May-2002 alfred

Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.


# 95759 30-Apr-2002 tanimura

Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.

Requested by: bde

Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to
sys/signalvar.h.

While I am here, sort include files alphabetically, where possible.


# 95553 27-Apr-2002 tanimura

Fix the code fragment clobbered in my last commit.


# 95552 27-Apr-2002 tanimura

Add a global sx sigio_lock to protect the pointer to the sigio object
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().

sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.


# 95478 26-Apr-2002 silby

Make sure that sockets undergoing accept filtering are aborted in a
LRU fashion when the listen queue fills up. Previously, there was
no mechanism to kick out old sockets, leading to an easy DoS of
daemons using accept filtering.

Reviewed by: alfred
MFC after: 3 days


# 95343 24-Apr-2002 silby

Remove sodropablereq - this function hasn't been used since the
syncache went in.

MFC after: 3 days


# 92754 20-Mar-2002 jeff

Backout part of my previous commit; I was wrong about vm_zone's handling of
limits on zones w/o objects.


# 92751 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 90227 05-Feb-2002 dillon

Get rid of the twisted MFREE() macro entirely.

Reviewed by: dg, bmilekic
MFC after: 3 days


# 89109 09-Jan-2002 silby

Revert 1.81; 1.19 fixed this already in a different way.


# 88949 06-Jan-2002 silby

Reorder a calculation in sbreserve so that it does not overflow
with multi-megabyte socket buffer sizes.

PR: 7420
MFC after: 3 weeks


# 88633 29-Dec-2001 alfred

Make AIO a loadable module.

Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO
will use at_exit(9).

Add functions at_exec(9), rm_at_exec(9) which function nearly the
same as at_exec(9) and rm_at_exec(9), these functions are called
on behalf of modules at the time of execve(2) after the image
activator has run.

Use a modified version of tegge's suggestion via at_exec(9) to close
an exploitable race in AIO.

Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral,
the problem was that one had to pass it a paramater indicating the
number of arguments which were actually the number of "int". Fix
it by using an inline version of the AS macro against the syscall
arguments. (AS should be available globally but we'll get to that
later.)

Add a primative system for dynamically adding kqueue ops, it's really
not as sophisticated as it should be, but I'll discuss with jlemon when
he's around.


# 88329 21-Dec-2001 peter

Avoid an interaction between syncache and accept filters. The syncache
code only passed up the connection to the tcp stack when it was complete,
so it went directly into the so_comp (complete) queue. However, with
accept filters, there is an additional phase before calling it "complete".

Reviewed by: jlemon


# 87821 13-Dec-2001 rwatson

o Back out portions of 1.50 and 1.47, eliminating sonewconn3() and
always deriving the credential for a newly accepted connection from
the listen socket. Previously, the selection of the credential
depended on the protocol: UNIX domain sockets would use the
connecting process's credential, and protocols supporting a creation
of the socket before the receiving end called accept() would use
the listening socket. After this change, it is always the listening
credential.

Reviewed by: green


# 86487 17-Nov-2001 dillon

Give struct socket structures a ref counting interface similar to
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.


# 84827 11-Oct-2001 jhb

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


# 84468 04-Oct-2001 dwmalone

Allow sbcreatecontrol to make cluster sized control messages.


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 78945 29-Jun-2001 jlemon

Fix up indentation.


# 77900 08-Jun-2001 peter

"Fix" the previous initial attempt at fixing TUNABLE_INT(). This time
around, use a common function for looking up and extracting the tunables
from the kernel environment. This saves duplicating the same function
over and over again. This way typically has an overhead of 8 bytes + the
path string, versus about 26 bytes + the path string.


# 77853 07-Jun-2001 peter

Back out part of my previous commit. This was a last minute change
and I botched testing. This is a perfect example of how NOT to do
this sort of thing. :-(


# 77843 06-Jun-2001 peter

Make the TUNABLE_*() macros look and behave more consistantly like the
SYSCTL_*() macros. TUNABLE_INT_DECL() was an odd name because it didn't
actually declare the int, which is what the name suggests it would do.


# 77598 01-Jun-2001 jesper

Revert the last bits of my bogus move of NMBCLUSTERS
to <sys/param.h>


# 77544 31-May-2001 jesper

Move the definition of NMBCLUSTERS from src/sys/kern/uipc_mbuf.c
to <sys/param.h>, so it's available to src/sys/netinet/ip_input.c,
and remove the now unneeded includes of "opt_param.h".

MFC after: 1 week


# 76166 01-May-2001 markm

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 68918 19-Nov-2000 dwmalone

Make sbcompress use the new M_WRITABLE macro. Previously sbcompress
could not compress into clusters. This could result in lots of
wasted clusters while recieving small packets from an interface
that uses clusters for all it's packets.

Patch is partially from BSDi (limiting the size of the copy) and
based on a patch for 4.1 by Ian Dowse <iedowse@maths.tcd.ie> and
myself.

Reviewed by: bmilekic
Obtained From: BSDi
Submitted by: iedowse


# 65495 05-Sep-2000 truckman

Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().


# 65278 31-Aug-2000 green

Fix hangs caused by overzealous code removal.

Thanks, Nickolay, for figuring out this is the problem.

Submitted by: Nickolay Dudorov <nnd@mail.nsk.ru>


# 65229 30-Aug-2000 green

Remove an extraneous setting of sb_hiwat.


# 65198 29-Aug-2000 green

Remove any possibility of hiwat-related race conditions by changing
the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and
a u_long target to set it to. The whole thing is splnet().

This fixes a problem that jdp has been able to provoke.


# 64046 31-Jul-2000 ps

Remove unnecessary call to splnet when setting an accept filter
since we are already at splnet.


# 61976 22-Jun-2000 alfred

fix races in the uidinfo subsystem, several problems existed:

1) while allocating a uidinfo struct malloc is called with M_WAITOK,
it's possible that while asleep another process by the same user
could have woken up earlier and inserted an entry into the uid
hash table. Having redundant entries causes inconsistancies that
we can't handle.

fix: do a non-waiting malloc, and if that fails then do a blocking
malloc, after waking up check that no one else has inserted an entry
for us already.

2) Because many checks for sbsize were done as "test then set" in a non
atomic manner it was possible to exceed the limits put up via races.

fix: instead of querying the count then setting, we just attempt to
set the count and leave it up to the function to return success or
failure.

3) The uidinfo code was inlining and repeating, lookups and insertions
and deletions needed to be in their own functions for clarity.

Reviewed by: green


# 61837 20-Jun-2000 alfred

return of the accept filter part II

accept filters are now loadable as well as able to be compiled into
the kernel.

two accept filters are provided, one that returns sockets when data
arrives the other when an http request is completed (doesn't work
with 0.9 requests)

Reviewed by: jmg


# 61799 18-Jun-2000 alfred

backout accept optimizations.

Requested by: jmg, dcs, jdp, nate


# 61714 15-Jun-2000 alfred

add socketoptions DELAYACCEPT and HTTPACCEPT which will not allow an accept()
until the incoming connection has either data waiting or what looks like a
HTTP request header already in the socketbuffer. This ought to reduce
the context switch time and overhead for processing requests.

The initial idea and code for HTTPACCEPT came from Yahoo engineers and has
been cleaned up and a more lightweight DELAYACCEPT for non-http servers
has been added

Reviewed by: silence on hackers.


# 59288 16-Apr-2000 jlemon

Introduce kqueue() and kevent(), a kernel event notification facility.


# 57719 03-Mar-2000 shin

CMSG_XXX macros alignment fixes to follow RFC2292.

Approved by: jkh

Submitted by: Partly from tech@openbsd
Reviewed by: itojun


# 57441 24-Feb-2000 shin

Add length check to sbcreatecontrol().

Now this check is necessary because IPv6 source routing might use
control data bigger than MLEN. (e.g. 16bytes IPv6 addr x 23 hops)
Actually mbuf cluster should be used in uipc_socket.c:sbcreatecontrol()
and uipc_syscalls.c:sockargs() when data size is bigger then MLEN,
and such patches were already in KAME environment and have been
confirmed to work well. I just forgot to merge them into 4.0, sorry.

For safety, I'll postpone such patches until after 4.0 release.
The effect of postponement is followings.
-Ping6 source routing hops are limitted to around 6 or so.
-If some apps do setsockopt IPV6_RTHDR and try to receive
incoming IPv6 source routing info, it can't receive more
than 6 hops source routing info.
(But currently, no apps seems to be doing it.)

Approved by: jkh


# 55943 14-Jan-2000 jasone

Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>


# 52070 09-Oct-1999 green

Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total
usage limit.


# 51757 28-Sep-1999 pb

In sbflush(), don't exit the while loop too early: this can cause
an empty mbuf to stay in the queue, then causing a needless panic
because sb_cc == 0 and sb_mbcnt != 0.

But we still need to panic rather than endlessly looping if, for
some reason, sb_cc == 0 and there are non-empty mbufs in the queue.

PR: kern/11988
Reviewed by: fenner


# 51381 19-Sep-1999 green

Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.


# 50477 28-Aug-1999 peter

$Id$ -> $FreeBSD$


# 48579 05-Jul-1999 msmith

Move the initialisation/tuning of nmbclusters from param.c/machdep.c
into uipc_mbuf.c. This reduces three sets of identical tunable code to
one set, and puts the initialisation with the mbuf code proper.

Make NMBUFs tunable as well.

Move the nmbclusters sysctl here as well.

Move the initialisation of maxsockets from param.c to uipc_socket2.c,
next to its corresponding sysctl.

Use the new tunable macros for the kern.vm.kmem.size tunable (this should have
been in a separate commit, whoops).


# 47992 17-Jun-1999 green

Reviewed by: the cast of thousands

This is the change to struct sockets that gets rid of so_uid and replaces
it with a much more useful struct pcred *so_cred. This is here to be able
to do socket-level credential checks (i.e. IPFW uid/gid support, to be added
to HEAD soon). Along with this comes an update to pidentd which greatly
simplifies the code necessary to get a uid from a socket. Soon to come:
a sysctl() interface to finding individual sockets' credentials.


# 46922 10-May-1999 peter

Update one set of comments.. s/so_q0/so_incomp/ and s/so_q/so_comp/ (that's
incomplete and complete connections I think)


# 46381 03-May-1999 billf

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 43196 25-Jan-1999 fenner

Port NetBSD's 19990120-accept bug fix. This works around the race condition
where select(2) can return that a listening socket has a connected socket
queued, the connection is broken, and the user calls accept(2), which then
blocks because there are no connections queued.

Reviewed by: wollman
Obtained from: NetBSD
(ftp://ftp.NetBSD.ORG/pub/NetBSD/misc/security/patches/19990120-accept)


# 41591 07-Dec-1998 archie

The "easy" fixes for compiling the kernel -Wunused: remove unreferenced static
and local variables, goto labels, and functions declared but not defined.


# 41298 23-Nov-1998 truckman

We can't call fsetown() from sonewconn() because sonewconn() is be called
from an interrupt context and fsetown() wants to peek at curproc, call
malloc(..., M_WAITOK), and fiddle with various unprotected data structures.
The fix is to move the code that duplicates the F_SETOWN/FIOSETOWN state
of the original socket to the new socket from sonewconn() to accept1(),
since accept1() runs in the correct context. Deferring this until the
process calls accept() is harmless since the process can't do anything
useful with SIGIO on the new socket until it has the descriptor for that
socket.

One could make the case for not bothering to duplicate the
F_SETOWN/FIOSETOWN state and requiring the process to explicitly make the
fcntl() or ioctl() call on the new socket, but this would be incompatible
with the previous implementation and might break programs which rely on
the old semantics.

This bug was discovered by Andrew Gallatin <gallatin@cs.duke.edu>.


# 41086 11-Nov-1998 truckman

Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR: kern/7899
Reviewed by: bde, elvind


# 40913 04-Nov-1998 fenner

Fix sbcheck() to check all packets on socket buffer.
Also fix data types and printf formats while I'm here.

PR: misc/8494

Panic instead of looping forever in sbflush(). If sb_mbcnt counts
more mbufs than sb_cc counts bytes, the original code can turn into an
infinite loop of removing 0 bytes from the socket buffer until it's empty.


# 38860 05-Sep-1998 bde

Fixed recently perpetrated printf format errors.


# 38808 04-Sep-1998 ache

make sbflush panic messages more descriptive


# 36735 07-Jun-1998 dfr

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 36528 31-May-1998 peter

Have the wakeup routine do the upcall if needed.

Obtained from: NetBSD


# 36119 17-May-1998 phk

s/nanoruntime/nanouptime/g
s/microruntime/microuptime/g

Reviewed by: bde


# 36079 15-May-1998 wollman

Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).


# 35413 24-Apr-1998 dg

Added kern.ipc.nmbclusters


# 35029 04-Apr-1998 phk

Time changes mark 2:

* Figure out UTC relative to boottime. Four new functions provide
time relative to boottime.

* move "runtime" into struct proc. This helps fix the calcru()
problem in SMP.

* kill mono_time.

* add timespec{add|sub|cmp} macros to time.h. (XXX: These may change!)

* nanosleep, select & poll takes long sleeps one day at a time

Reviewed by: bde
Tested by: ache and others


# 33955 01-Mar-1998 guido

Make sure that you can only bind a more specific address when it is
done by the same uid.
Obtained from: OpenBSD


# 29210 07-Sep-1997 bde

Removed trailing semicolons from the definitions of the sysctl
declaration macros so that a semicolon can be added when the macros
are invoked without giving a (pedantic) syntax error. Invocations
need to be followed by a semicolon so that programs like indent and
gtags don't get confused.

Fixed the one invocation that wasn't followed by a trailing semicolon.


# 29111 04-Sep-1997 tegge

sonewconn no longer passes curproc to the protocol attach method
since that might cause in_pcballoc to call MALLOC with M_WAITOK during
a software interrupt.
Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>


# 29041 02-Sep-1997 bde

Removed unused #includes.


# 28270 16-Aug-1997 wollman

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 27531 19-Jul-1997 fenner

Remove sonewconn() macro kludge, introduced in 4.3-Reno to catch argument
mismatches. Prototypes do a much better job these days.

Noticed by: bde


# 26096 24-May-1997 peter

Attempt to convert the ip_divert code to use the new-style protocol request
switch. I needed 'LINT' to compile for other reasons so I kinda got the
blood on my hands. Note: I don't know how to test this, I don't know if
it works correctly.


# 25201 27-Apr-1997 wollman

The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled
in.

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.


# 24442 31-Mar-1997 dg

In accept1(), falloc() is called after the process has awoken, but prior
to removing the connection from the queue. The problem here is that
falloc() may block and this would allow another process to accept the
connection instead. If this happens to leave the queue empty, then the
system will panic with an "accept: nothing queued".

Also changed a wakeup() to a wakeup_one() to avoid the "thundering herd"
problem on new connections in Apache (or any other application that has
multiple processes blocked in accept() for the same socket).


# 23081 24-Feb-1997 wollman

Create a new branch of the kernel MIB, kern.ipc, to store
all of the configurables and instrumentation related to
inter-process communication mechanisms. Some variables,
like mbuf statistics, are instrumented here for the first
time.

For mbuf statistics: also keep track of m_copym() and
m_pullup() failures, and provide for the user's inspection
the compiled-in values of MSIZE, MHLEN, MCLBYTES, and MINCLSIZE.


# 22936 19-Feb-1997 wollman

Make the operation of sonewconn1() a bit clearer by calling
pru_attach() before putting the new connection on the
connection queue.


# 22899 18-Feb-1997 wollman

uipc_mbuf.c: do a better job of counting how often we have to wait
for memory, or are denied a cluster.

uipc_socket2.c: define some generic ``operation-not-supported'' entry points
for pr_usrreqs.


# 22658 13-Feb-1997 wollman

For large values of sb_max or MCLBYTES, it was possible for the expression
sb_max * MCLBYTES / (MSIZE + MCLBYTES)
used in sbreserve() to overflow, causing all socket creation attempts
to fail. Force the calculation to use u_quad_t's, which makes overflow
less likely.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 19622 11-Nov-1996 fenner

Add the IP_RECVIF socket option, which supplies a packet's incoming interface
using a sockaddr_dl.

Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR)
to work for multicast UDP and raw sockets as well. (They previously only
worked for unicast UDP).


# 18874 11-Oct-1996 pst

Fix two bugs I accidently put into the syn code at the last minute
(yes I had tested the hell out of this).

I've also temporarily disabled the code so that it behaves as it previously
did (tail drop's the syns) pending discussion with fenner about some socket
state flags that I don't fully understand.

Submitted by: fenner


# 18787 07-Oct-1996 pst

Increase robustness of FreeBSD against high-rate connection attempt
denial of service attacks.

Reviewed by: bde,wollman,olah
Inspired by: vjs@sgi.com


# 18365 19-Sep-1996 pst

Add a new sysctl variable kern.sominqueue to override the MINIMUM queue
specified in a listen(2) system call.


# 17675 19-Aug-1996 julian

for kern_conf.c, start allocating dynamic major numbers
half way through the range rather than possibly colliding with
fixed elements. Increase the size of the arrays to take this into account..
remember that each element in the array is now only 1 ponter so this
isn't that much..

also note a possible bug in debugging code in uipc_socket2.c (add XXX)


# 17096 11-Jul-1996 wollman

Modify the kernel to use the new pr_usrreqs interface rather than the old
pr_usrreq mechanism which was poorly designed and error-prone. This
commit renames pr_usrreq to pr_ousrreq so that old code which depended on it
would break in an obvious manner. This commit also implements the new
interface for TCP, although the old function is left as an example
(#ifdef'ed out). This commit ALSO fixes a longstanding bug in the
TCP timer processing (introduced by davidg on 1995/04/12) which caused
timer processing on a TCB to always stop after a single timer had
expired (because it misinterpreted the return value from tcp_usrreq()
to indicate that the TCB had been deleted). Finally, some code
related to polling has been deleted from if.c because it is not
relevant t -current and doesn't look at all like my current code.


# 17047 09-Jul-1996 wollman

This is a proposal-in-code for a substantial modification of the way
the high kernel calls into a protocol stack to perform requests on the
user's behalf. We replace the pr_usrreq() entry in struct protosw with a
pointer to a structure containing pointers to functions which implement
the various reuqests; each function is declared with the correct type and
number of arguments. (This is unlike the current scheme in which a quarter
of the requests take arguments of type other than (struct mbuf *) and the
difference is papered over with casts.) There are a few benefits to this
new scheme:

1) Arguments are passed with their correct types, and null-pointer dummies
are no longer necessary.

2) There should be slightly better caching effects from eliminating
the prximity to extraneous code and th switch in pr_usrreq().

3) It becomes much easier to change the types of the arguments to something
other than `struct mbuf *' (e.g.,pushing the work of sosend() into
the protocol as advocated by Van Jacobson).

There is one principal drawback: existing protocol stacks need to
be modified. This is alleviated by compatibility code in
uipc_socket2.c and uipc_domain.c which emulates the new interface
in terms of the old and vice versa.

This idea is not original to me. I read about what Jacobson did
in one of his papers and have tried to implement the first steps
towards something like that here. Much work remains to be done.


# 16322 12-Jun-1996 gpalmer

Clean up -Wunused warnings.

Reviewed by: bde


# 14547 11-Mar-1996 dg

Changed socket code to use 4.4BSD queue macros. This includes removing
the obsolete soqinsque and soqremque functions as well as collapsing
so_q0len and so_qlen into a single queue length of unaccepted connections.
Now the queue of unaccepted & complete connections is checked directly
for queued sockets. The new code should be functionally equivilent to
the old while being substantially faster - especially in cases where
large numbers of connections are often queued for accept (e.g. http).


# 13267 05-Jan-1996 wollman

Eliminate the dramatic TCP performance decrease observed for writes in
the range [210:260] by sweeping the problem under the rug. This change
has the following effects:

1) A new MIB variable in the kern branch is defined to allow modification
of the socket buffer layer's ``wastage factor'' (which determines how
much unused-but-allocated space in mbufs and mbuf clusters is allowed
in a socket buffer).

2) The default value of the wastage factor is changed from 2 to 8.
The original value was chosen when MINCLSIZE was 7*MLEN (!), and is not
appropriate for an environment where MINCLSIZE is much less.

The real solution to this problem is to scrap both mbufs and sockbufs
and completely redesign the buffering mechanism used at both levels.


# 12843 14-Dec-1995 bde

Nuked ambiguous sleep message strings:
old: new:
netcls[] = "netcls" "soclos"
netcon[] = "netcon" "accept", "connec"
netio[] = "netio" "sblock", "sbwait"


# 12041 03-Nov-1995 wollman

Make somaxconn (maximum backlog in a listen(2) request) and sb_max
(maximum size of a socket buffer) tunable.

Permit callers of listen(2) to specify a negative backlog, which
is translated into somaxconn. Previously, a negative backlog was
silently translated into 0.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 3308 02-Oct-1994 phk

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources