#
272461 |
|
02-Oct-2014 |
gjb |
Copy stable/10@r272459 to releng/10.1 as part of the 10.1-RELEASE process.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation |
#
270094 |
|
17-Aug-2014 |
mjg |
MFC r269020:
Cosmetic changes to unp_internalize
Don't throw away the result of fget_unlocked. Move fdp increment to for loop to make it consistent with similar code elsewhere.
|
#
269490 |
|
03-Aug-2014 |
peter |
Insta-MFC r269489: partial revert of r262867 which was MFC'ed as r263820. Don't ignore sndbuf/rcvbuf limits for SOCK_DGRAM sockets. This appears to be an edit error or patch fuzz mismatch.
|
#
269046 |
|
24-Jul-2014 |
kevlo |
MFC r268787:
Deprecate m_act. Use m_nextpkt always.
|
#
269044 |
|
24-Jul-2014 |
kevlo |
MFC r268601:
Make bind(2) and connect(2) return EAFNOSUPPORT for AF_UNIX on wrong address family.
See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191586 for the original discussion.
Reviewed by: terry
|
#
268341 |
|
06-Jul-2014 |
mjg |
MFC r267947:
Check lower bound of cmsg_len.
If passed cm->cmsg_len was below cmsghdr size the experssion: datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data;
would give negative result. However, in practice it would not result in a crash because the kernel would try to obtain garbage fds for given process and would error out with EBADF.
PR: 124908 Submitted by: campbell mumble.net (modified a little)
|
#
264080 |
|
03-Apr-2014 |
asomers |
MFC r263116
Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically changed the sending sockbuf's high water mark whenever adding or removing data from the receiving sockbuf. It worked for stream sockets, but it never worked for SOCK_SEQPACKET sockets because of their atomic nature. If the sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit 3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that sockbuf, causing sosend_generic and friends to block. uipc_rcvd will likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no longer need to call chgsbsize() on every send and receive because they don't change the sockbuf's high water mark. Also, uipc_sense no longer needs to acquire the UIPC linkage lock, because it's simpler to compute the st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum size by at most one packet of size less than the max. I don't think that's a serious problem. In fact, I'm not even positive that FreeBSD guarantees a socket will always stay within its nominal size limit.
sys/sys/sockbuf.h Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP backpressure mechanism. Removing obsolete unpcb fields from db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c Clear expected failures from ATF.
|
#
263820 |
|
27-Mar-2014 |
asomers |
MFC r262867
Fix PR kern/185813 "SOCK_SEQPACKET AF_UNIX sockets with asymmetrical buffers drop packets". It was caused by a check for the space available in a sockbuf, but it was checking the wrong sockbuf.
sys/sys/sockbuf.h sys/kern/uipc_sockbuf.c Add sbappendaddr_nospacecheck_locked(), which is just like sbappendaddr_locked but doesn't validate the receiving socket's space. Factor out common code into sbappendaddr_locked_internal(). We shouldn't simply make sbappendaddr_locked check the space and then call sbappendaddr_nospacecheck_locked, because that would cause the O(n) function m_length to be called twice.
sys/kern/uipc_usrreq.c Use sbappendaddr_nospacecheck_locked for SOCK_SEQPACKET sockets, because the receiving sockbuf's size limit is irrelevant.
tests/sys/kern/unix_seqpacket_test.c Now that 185813 is fixed, pipe_128k_8k fails intermittently due to 185812. Make it fail every time by adding a usleep after starting the writer thread and before starting the reader thread in test_pipe. That gives the writer time to fill up its send buffer. Also, clear the expected failure message due to 185813. It actually said "185812", but that was a typo.
PR: kern/185813
|
#
256281 |
|
10-Oct-2013 |
gjb |
Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.
Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
|
#
255478 |
|
11-Sep-2013 |
glebius |
Provide pr_ctloutput method for AF_LOCAL/SOCK_SEQPACKET sockets. This makes setsockopt() on them working.
Reported by: Yuri <yuri rawbw.com> Approved by: re (kib)
|
#
255219 |
|
04-Sep-2013 |
pjd |
Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; };
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements.
The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...); void cap_rights_set(cap_rights_t *rights, ...); void cap_rights_clear(cap_rights_t *rights, ...); bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights); void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src); void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src); bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
|
#
252502 |
|
02-Jul-2013 |
mjg |
Fix receiving fd over unix socket broken in r247740.
If n fds were passed, it would receive the first one n times.
Reported by: Shawn Webb <lattera@gmail.com>, koobs, gleb Tested by: koobs, gleb Reviewed by: pjd
|
#
251374 |
|
04-Jun-2013 |
glebius |
Improve r250890, so that we stop processing of a message with zero descriptors as early as possible, and assert that number of descriptors is positive in unp_freerights().
Reviewed by: mjg, pjd, jilles
|
#
250890 |
|
21-May-2013 |
mjg |
passing fd over unix socket: fix a corner case where caller wants to pass no descriptors.
Previously the kernel would leak memory and try to free a potentially arbitrary pointer.
Reviewed by: pjd
|
#
250460 |
|
10-May-2013 |
eadler |
Fxi a bunch of typos.
PR: misc/174625 Submitted by: Jeremy Chadwick <jdc@koitsu.org>
|
#
249480 |
|
14-Apr-2013 |
mjg |
Add fdallocn function and use it when passing fds over unix socket.
This gets rid of "unp_externalize fdalloc failed" panic.
Reviewed by: pjd MFC after: 1 week
|
#
248534 |
|
19-Mar-2013 |
jilles |
Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.
This change allows creating file descriptors with close-on-exec set in some situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file descriptors (SCM_RIGHTS) atomically close-on-exec.
The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD. MSG_CMSG_CLOEXEC is the first free bit for MSG_*.
The SOCK_* flags are not passed to MAC because this may cause incorrect failures and can be done later via fcntl() anyway. On the other hand, audit is expected to cope with the new flags.
For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags argument.
Reviewed by: kib
|
#
248176 |
|
11-Mar-2013 |
pjd |
Fix memory leak when one process send descriptor over UNIX domain socket, but the other process exited before receiving it.
|
#
247740 |
|
03-Mar-2013 |
pjd |
For some reason when I started to pass filedescent structures instead of pointers to the file structure receiving descriptors stopped to work when also at least few kilobytes of data is being send. In the kernel the soreceive_generic() function doesn't see control mbuf as the first mbuf and unp_externalize() is never called, first 6(?) kilobytes of data is missing as well on receiving end.
This breaks for example tmux.
I don't know yet why going from 8 bytes to sizeof(struct filedescent) per descriptor (or even to 16 bytes per descriptor) breaks things, but to work-around it for now use 8 bytes per file descriptor at the cost of memory allocation.
Reported by: flo, Diane Bruce, Jan Beich <jbeich@tormail.org> Simple testcase provided by: mjg
|
#
247736 |
|
03-Mar-2013 |
pjd |
Plug memory leaks in file descriptors passing.
|
#
247667 |
|
02-Mar-2013 |
pjd |
- Implement two new system calls:
int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);
which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'.
- Add manual pages for the new syscalls.
- Make the new syscalls available for processes in capability mode sandbox.
- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work.
- Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor.
- Update procstat(1) to recognize the new capability rights.
- Document the new capability rights in cap_rights_limit(2).
Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des
|
#
247602 |
|
01-Mar-2013 |
pjd |
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was heavly modified.
- The audit subsystem, kdump and procstat tools were updated to recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below:
CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT.
Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2).
Added CAP_SYMLINKAT: - Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ | PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ | PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE | PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ) #define CAP_PWRITE (CAP_SEEK | CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ) #define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \ (CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \ CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \ CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \ CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib
|
#
246826 |
|
15-Feb-2013 |
pluknet |
Add support of passing SCM_BINTIME ancillary data object for PF_LOCAL sockets.
PR: kern/175883 Submitted by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Discussed with: glebius, phk MFC after: 2 weeks
|
#
243999 |
|
07-Dec-2012 |
pjd |
Configure UMA warnings for the following zones: - unp_zone: kern.ipc.maxsockets limit reached - socket_zone: kern.ipc.maxsockets limit reached - zone_mbuf: kern.ipc.nmbufs limit reached - zone_clust: kern.ipc.nmbclusters limit reached - zone_jumbop: kern.ipc.nmbjumbop limit reached - zone_jumbo9: kern.ipc.nmbjumbo9 limit reached - zone_jumbo16: kern.ipc.nmbjumbo16 limit reached
Note that those warnings are printed not often than every five minutes and can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0.
Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks
|
#
243342 |
|
20-Nov-2012 |
kib |
Schedule garbage collection run for the in-flight rights passed over the unix domain sockets to the next tick, coalescing the serial calls until the collection fires. The thought is that more work for the collector could arise in the near time, allowing to clean more and not spend too much CPU on repeated collection when there is no garbage.
Currently the collection task is fired immediately upon unix domain socket close if there are any rights in flight, which caused excessive CPU usage and too long blocking of the threads waiting for unp_list_lock and unp_link_rwlock in write mode.
Robert noted that it would be nice if we could find some heuristic by which we decide whether to run GC a bit more quickly. E.g., if the number of UNIX domain sockets is close to its resource limit, but not quite.
Reported and tested by: Markus Gebert <markus.gebert@hostpoint.ch> Reviewed by: rwatson MFC after: 2 weeks
|
#
243152 |
|
16-Nov-2012 |
glebius |
Update comment.
|
#
241896 |
|
22-Oct-2012 |
kib |
Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes.
Conducted and reviewed by: attilio Tested by: pho
|
#
241011 |
|
27-Sep-2012 |
mdf |
Fix up kernel sources to be ready for a 64-bit ino_t.
Original code by: Gleb Kurtsou
|
#
240214 |
|
07-Sep-2012 |
glebius |
Supply the pr_ctloutput method for local datagram sockets, so that setsockopt() and getsockopt() work on them.
This makes 'tools/regression/sockets/unix_cmsg -t dgram' more successful.
|
#
237036 |
|
13-Jun-2012 |
pjd |
When checking if file descriptor number is valid, explicitely check for 'fd' being less than 0 instead of using cast-to-unsigned hack.
Today's commit was brought to you by the letters 'B', 'D' and 'E' :)
|
#
232317 |
|
29-Feb-2012 |
trociny |
Introduce VOP_UNP_BIND(), VOP_UNP_CONNECT(), and VOP_UNP_DETACH() operations for setting and accessing vnode's v_socket field.
The operations are necessary to implement proper unix socket handling on layered file systems like nullfs(5).
This change fixes the long standing issue with nullfs(5) being in that unix sockets did not work between lower and upper layers: if we bound to a socket on the lower layer we could connect only to the lower path; if we bound to the upper layer we could connect only to the upper path. The new behavior is one can connect to both the lower and the upper paths regardless what layer path one binds to.
PR: kern/51583, kern/159663 Suggested by: kib Reviewed by: arch MFC after: 2 weeks
|
#
232152 |
|
25-Feb-2012 |
trociny |
When detaching an unix domain socket, uipc_detach() checks unp->unp_vnode pointer to detect if there is a vnode associated with (binded to) this socket and does necessary cleanup if there is.
The issue is that after forced unmount this check may be too late as the unp_vnode is reclaimed and the reference is stale.
To fix this provide a helper function that is called on a socket vnode reclamation to do necessary cleanup.
Pointed by: kib Reviewed by: kib MFC after: 2 weeks
|
#
231976 |
|
21-Feb-2012 |
trociny |
unp_connect() may use a shared lock on the vnode to fetch the socket.
Suggested by: jhb Reviewed by: jhb, kib, rwatson MFC after: 2 weeks
|
#
227309 |
|
07-Nov-2011 |
ed |
Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
|
#
225827 |
|
28-Sep-2011 |
bz |
Fix handling of corrupt compress(1)ed data. [11:04]
Add missing length checks on unix socket addresses. [11:05]
Approved by: so (cperciva) Approved by: re (kensmith) Security: FreeBSD-SA-11:04.compress Security: CVE-2011-2895 [11:04] Security: FreeBSD-SA-11:05.unix
|
#
225040 |
|
20-Aug-2011 |
kib |
Prevent the hiwatermark for the unix domain socket from becoming effectively negative. Often seen as upstream fastcgi connection timeouts in nginx when using sendfile over unix domain sockets for communication.
Sendfile(2) may send more bytes then currently allowed by the hiwatermark of the socket, e.g. because the so_snd sockbuf lock is dropped after sbspace() call in the kern_sendfile() loop. In this case, recalculated hiwatermark will overflow. Since lowatermark is renewed as half of the hiwatermark by sendfile code, and both are unsigned, the send buffer never reaches the free space requested by lowatermark, causing indefinite wait in sendfile.
Reviewed by: rwatson Approved by: re (bz) MFC after: 2 weeks
|
#
218757 |
|
16-Feb-2011 |
bz |
Mfp4 CH=177274,177280,177284-177285,177297,177324-177325
VNET socket push back: try to minimize the number of places where we have to switch vnets and narrow down the time we stay switched. Add assertions to the socket code to catch possibly unset vnets as seen in r204147.
While this reduces the number of vnet recursion in some places like NFS, POSIX local sockets and some netgraph, .. recursions are impossible to fix.
The current expectations are documented at the beginning of uipc_socket.c along with the other information there.
Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb Tested by: zec
Tested by: Mikolaj Golub (to.my.trociny gmail.com) MFC after: 2 weeks
|
#
218168 |
|
01-Feb-2011 |
kib |
The unp_gc() function drops and reaquires lock between scan and collect phases. The unp_discard() function executes unp_externalize_fp(), which might make the socket eligible for gc-ing, and then, later, taskqueue will close the socket. Since unp_gc() dropped the list lock to do the malloc, close might happen after the mark step but before the collection step, causing collection to not find the socket and miss one array element.
I believe that the race was there before r216158, but the stated revision made the window much wider by postponing the close to taskqueue sometimes.
Only process as much array elements as we find the sockets during second phase of gc [1]. Take linkage lock and recheck the eligibility of the socket for gc, as well as call fhold() under the linkage lock.
Reported and tested by: jmallett Submitted by: jmallett [1] Reviewed by: rwatson, jeff (possibly) MFC after: 1 week
|
#
217555 |
|
18-Jan-2011 |
mdf |
Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need to rely on the format string.
|
#
216158 |
|
03-Dec-2010 |
kib |
Trim whitespaces at the end of lines. Use the commit to record proper log message for r216150.
MFC after: 1 week
If unix socket has a unix socket attached as the rights that has a unix socket attached as the rights that has a unix socket attached as the rights ... Kernel may overflow the stack on attempt to close such socket.
Only close the rights file in the context of the current close if the file is not unix domain socket. Otherwise, postpone the work to taskqueue, preventing unlimited recursion.
The pass of the unix domain sockets over the SCM_RIGHTS message control is not widely used, and more, the close of the socket with still attached rights is mostly an application failure. The change should not affect the performance of typical users of SCM_RIGHTS.
Reviewed by: jeff, rwatson
|
#
216150 |
|
03-Dec-2010 |
kib |
Reviewed by: jeff, rwatson MFC after: 1 week
|
#
210365 |
|
22-Jul-2010 |
trasz |
Remove spurious '/*-' marks and fix some other style problems.
Submitted by: bde@
|
#
210226 |
|
18-Jul-2010 |
trasz |
Revert r210225 - turns out I was wrong; the "/*-" is not license-only thing; it's also used to indicate that the comment should not be automatically rewrapped.
Explained by: cperciva@
|
#
210225 |
|
18-Jul-2010 |
trasz |
The "/*-" comment marker is supposed to denote copyrights. Remove non-copyright occurences from sys/sys/ and sys/kern/.
|
#
197794 |
|
05-Oct-2009 |
rwatson |
Fix build on amd64, where sysctl arg1 is a pointer.
Reported by: Mr Tinderbox MFC after: 3 months
|
#
197775 |
|
05-Oct-2009 |
rwatson |
First cut at implementing SOCK_SEQPACKET support for UNIX (local) domain sockets. This allows for reliable bi-directional datagram communication over UNIX domain sockets, in contrast to SOCK_DGRAM (M:N, unreliable) or SOCK_STERAM (bi-directional bytestream). Largely, this reuses existing UNIX domain socket code. This allows applications requiring record- oriented semantics to do so reliably via local IPC.
Some implementation notes (also present in XXX comments):
- Currently we lack an sbappend variant able to do datagrams and control data without doing addresses, so we mark SOCK_SEQPACKET as PR_ADDR. Adding a new variant will solve this problem.
- UNIX domain sockets on FreeBSD provide back-pressure/flow control notification for stream sockets by manipulating the send socket buffer's size during pru_send and pru_rcvd. This trick works less well for SOCK_SEQPACKET as sosend_generic() uses sb_hiwat not just to manage blocking, but also to determine maximum datagram size. Fixing this requires rethinking how back-pressure is done for SOCK_SEQPACKET; in the mean time, it's possible to get EMSGSIZE when buffers fill, instead of blocking.
Discussed with: benl Reviewed by: bz, rpaulo MFC after: 3 months Sponsored by: Google
|
#
196019 |
|
01-Aug-2009 |
rwatson |
Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes.
Reviewed by: bz Approved by: re (vimage blanket)
|
#
194707 |
|
23-Jun-2009 |
jamie |
Remove unnecessary/redundant includes.
Approved by: bz (mentor)
|
#
194460 |
|
18-Jun-2009 |
jhb |
Fix a deadlock in the getpeername() method for UNIX domain sockets. Instead of locking the local unp followed by the remote unp, use the same locking model as accept() and read lock the global link lock followed by the remote unp while fetching the remote sockaddr.
Reported by: Mel Flynn mel.flynn of mailing.thruhere.net Reviewed by: rwatson MFC after: 1 week
|
#
193511 |
|
05-Jun-2009 |
rwatson |
Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
|
#
193332 |
|
02-Jun-2009 |
rwatson |
Add internal 'mac_policy_count' counter to the MAC Framework, which is a count of the number of registered policies.
Rather than unconditionally locking sockets before passing them into MAC, lock them in the MAC entry points only if mac_policy_count is non-zero.
This avoids locking overhead for a number of socket system calls when no policies are registered, eliminating measurable overhead for the MAC Framework for the socket subsystem when there are no active policies.
Possibly socket locks should be acquired by policies if they are required for socket labels, which would further avoid locking overhead when there are policies but they don't require labeling of sockets, or possibly don't even implement socket controls.
Obtained from: TrustedBSD Project
|
#
191816 |
|
05-May-2009 |
zec |
Change the curvnet variable from a global const struct vnet *, previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all vnet instances.
Approved by: julian (mentor)
|
#
190888 |
|
10-Apr-2009 |
rwatson |
Remove VOP_LEASE and supporting functions. This hasn't been used since the removal of NQNFS, but was left in in case it was required for NFSv4. Since our new NFSv4 client and server can't use it for their requirements, GC the old mechanism, as well as other unused lease- related code and interfaces.
Due to its impact on kernel programming and binary interfaces, this change should not be MFC'd.
Proposed by: jeff Reviewed by: jeff Discussed with: rmacklem, zach loafman @ isilon
|
#
189544 |
|
08-Mar-2009 |
rwatson |
Decompose the global UNIX domain sockets rwlock into two different locks: a global list/counter/generation counter protected by a new mutex unp_list_lock, and a global linkage rwlock, unp_global_rwlock, which protects the connections between UNIX domain sockets.
This eliminates conditional lock acquisition that was previously a property of the global lock being held over sonewconn() leading to a call to uipc_attach(), which also required the global lock, but couldn't rely on it as other paths existed to uipc_attach() that didn't hold it: now uipc_attach() uses only the list lock, which follows the linkage lock in the lock order. It may also reduce contention on the global lock for some workloads.
Add global UNIX domain socket locks to hard-coded witness lock order.
MFC after: 1 week Discussed with: kris
|
#
186684 |
|
01-Jan-2009 |
rwatson |
White space and comment tweaks.
MFC after: 3 weeks
|
#
186603 |
|
30-Dec-2008 |
rwatson |
Rename mbcnt to mbcnt_delta in uipc_send() -- unlike other local variables named mbcnt in uipc_usrreq.c, this instance is a delta rather than a cache of sb_mbcnt.
MFC after: 3 weeks
|
#
184205 |
|
23-Oct-2008 |
des |
Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after: 3 months
|
#
183764 |
|
11-Oct-2008 |
rwatson |
Remove stale comment: while uipc_connect2() was, until recently, not static so it could be used by fifofs (actually portalfs), it is now static.
Submitted by: kensmith
|
#
183690 |
|
08-Oct-2008 |
rwatson |
Remove stale comment (and XXX saying so) about why we zero the file descriptor pointer in unp_freerights: we can no longer recurse into unp_gc due to unp_gc being invoked in a deferred way, but it's still a good idea.
MFC after: 3 days
|
#
183689 |
|
08-Oct-2008 |
rwatson |
Differentiate pr_usrreqs for stream and datagram UNIX domain sockets, and employ soreceive_dgram for the datagram case.
MFC after: 3 months
|
#
183650 |
|
06-Oct-2008 |
rwatson |
Now that portalfs doesn't directly invoke uipc_connect2(), make it a static symbol.
MFC after: 3 days
|
#
183572 |
|
03-Oct-2008 |
rwatson |
Further minor cleanups to UNIX domain sockets:
- Staticize and locally prototype functions uipc_ctloutput(), unp_dispose(), unp_init(), and unp_externalize(), none of which have been required outside of uipc_usrreq.c since uipc_proto.c was removed. - Remove stale prototype for uipc_usrreq(), which has not existed in the code since 1997 - Forward declare and staticize uipc_usrreqs structure in uipc_usrreq.c and not un.h. - Comment on why uipc_connect2() is still non-static -- it is used directly by fifofs. - Remove stale comments, tidy up whitespace.
MFC after: 3 days (where applicable)
|
#
183563 |
|
03-Oct-2008 |
rwatson |
Remove or update several stale comments.
A bit of whitespace/style cleanup.
Update copyright.
MFC after: 3 days (applicable changes)
|
#
180820 |
|
25-Jul-2008 |
trhodes |
Fill in a few sysctl descriptions.
Approved by: rwatson
|
#
180238 |
|
03-Jul-2008 |
emaste |
Use bcopy instead of strlcpy in uipc_bind and unp_connect, since soun->sun_path isn't a null-terminated string. As UNIX(4) states, "the terminating NUL is not part of the address." Since strlcpy has to return "the total length of the string [it] tried to create," it walks off the end of soun->sun_path looking for a \0.
This reverts r105332.
Reported by: Ryan Stone
|
#
175453 |
|
18-Jan-2008 |
rwatson |
Move unlock of global UNIX domain socket lock slightly lower in unp_connect(): it is expected to return with the lock held, and two possible error paths otherwise returned with it unlocked.
The fix committed here is slightly different from the patch in the PR, but along an alternative line suggested in the PR.
PR: 119778 MFC after: 3 days Submitted by: James Juran <james dot juran at baesystems dot com>
|
#
175294 |
|
13-Jan-2008 |
attilio |
VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary.
KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed.
Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
|
#
175212 |
|
10-Jan-2008 |
rwatson |
Remove "lock pushdown" todo item in comment -- I did that for 7.0.
MFC after: 3 weeks
|
#
175211 |
|
10-Jan-2008 |
rwatson |
Correct typos in comments.
MFC after: 3 weeks
|
#
175026 |
|
31-Dec-2007 |
jeff |
- Place the fhold() in unp_internalize_fp to be more consistent with refs. - Clear all of the gc flags before doing a run. Stale flags were causing us to skip some descriptors. - If a unp socket has been marked REF in a gc pass it can't be dead.
Found by: rwatson's test tool.
|
#
175009 |
|
31-Dec-2007 |
jeff |
- Check the correct variable against NULL in two places. - If the unp_file is NULL that means it has never been internalized and it must be reachable.
|
#
174988 |
|
29-Dec-2007 |
jeff |
Remove explicit locking of struct file. - Introduce a finit() which is used to initailize the fields of struct file in such a way that the ops vector is only valid after the data, type, and flags are valid. - Protect f_flag and f_count with atomic operations. - Remove the global list of all files and associated accounting. - Rewrite the unp garbage collection such that it no longer requires the global list of all files and instead uses a list of all unp sockets. - Mark sockets in the accept queue so we don't incorrectly gc them.
Tested by: kris, pho
|
#
172930 |
|
24-Oct-2007 |
rwatson |
Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms:
mac_<object>_<method/action> mac_<object>_check_<method/action>
The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names.
All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI.
Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
|
#
171599 |
|
26-Jul-2007 |
pjd |
When we do open, we should lock the vnode exclusively. This fixes few races: - fifo race, where two threads assign v_fifoinfo, - v_writecount modifications, - v_object modifications, - and probably more...
Discussed with: kib, ups Approved by: re (rwatson)
|
#
170096 |
|
29-May-2007 |
rwatson |
Add DDB "show unpcb" command, allowing DDB to print out many pertinent details from UNIX domain socket protocol layer state.
|
#
169471 |
|
11-May-2007 |
rwatson |
Remove more one more stale comment regarding unpcb type-safety.
|
#
169470 |
|
11-May-2007 |
rwatson |
Clarify and update quite a few comments to reflect locking optimizations, the addition of unpcb refcounts, and bug fixes. Some of these fixes are appropriate for MFC.
MFC after: 3 days
|
#
169307 |
|
06-May-2007 |
wkoszek |
Don't acquire Giant unconditionally.
Reviewed by: rwatson
|
#
168355 |
|
04-Apr-2007 |
rwatson |
Replace custom file descriptor array sleep lock constructed using a mutex and flags with an sxlock. This leads to a significant and measurable performance improvement as a result of access to shared locking for frequent lookup operations, reduced general overhead, and reduced overhead in the event of contention. All of these are imported for threaded applications where simultaneous access to a shared file descriptor array occurs frequently. Kris has reported 2x-4x transaction rate improvements on 8-core MySQL benchmarks; smaller improvements can be expected for many workloads as a result of reduced overhead.
- Generally eliminate the distinction between "fast" and regular acquisisition of the filedesc lock; the plan is that they will now all be fast. Change all locking instances to either shared or exclusive locks.
- Correct a bug (pointed out by kib) in fdfree() where previously msleep() was called without the mutex held; sx_sleep() is now always called with the sxlock held exclusively.
- Universally hold the struct file lock over changes to struct file, rather than the filedesc lock or no lock. Always update the f_ops field last. A further memory barrier is required here in the future (discussed with jhb).
- Improve locking and reference management in linux_at(), which fails to properly acquire vnode references before using vnode pointers. Annotate improper use of vn_fullpath(), which will be replaced at a future date.
In fcntl(), we conservatively acquire an exclusive lock, even though in some cases a shared lock may be sufficient, which should be revisited. The dropping of the filedesc lock in fdgrowtable() is no longer required as the sxlock can be held over the sleep operation; we should consider removing that (pointed out by attilio).
Tested by: kris Discussed with: jhb, kris, attilio, jeff
|
#
167487 |
|
12-Mar-2007 |
rwatson |
In uipc_close(), we no longer always free the unpcb, as the last reference may be dropped later. In this case, always unlock the unpcb so as not to leak the lock.
Found by: kris (BugMagnet)
|
#
167133 |
|
01-Mar-2007 |
rwatson |
Remove two simultaneous acquisitions of multiple unpcb locks from uipc_send in cases where only a global read lock is held by breaking them out and avoiding the unpcb lock acquire in the common case. This avoids deadlocks which manifested with X11, and should also marginally further improve performance.
Reported by: sepotvin, brooks
|
#
167097 |
|
28-Feb-2007 |
rwatson |
Lock unp2 after checking for a non-NULL unp2 pointer in uipc_send() on datagram UNIX domain sockets, not before.
|
#
167030 |
|
26-Feb-2007 |
rwatson |
Revise locking strategy used for UNIX domain sockets in order to improve concurrency:
- Add per-unpcb mutexes protecting unpcb connection state, fields, etc.
- Replace global UNP mutex with a global UNP rwlock, which will protect the UNIX domain socket connection topology, v_socket, and be acquired exclusively before acquiring more than per-unpcb at a time in order to avoid lock order issues.
In performance measurements involving MySQL, this change has little or no overhead on UP (+/- 1%), but leads to a significant (5%-30%) improvement in multi-processor measurements using the sysbench and supersmack benchmarks.
Much testing by: kris Approved by: re (kensmith)
|
#
166883 |
|
22-Feb-2007 |
rwatson |
Add an additional MAC check to the UNIX domain socket connect path: check that the subject has read/write access to the vnode using the vnode MAC check.
MFC after: 3 weeks Submitted by: Spencer Minear <spencer_minear at securecomputing dot com> Obtained from: TrustedBSD Project
|
#
166844 |
|
20-Feb-2007 |
rwatson |
Break introductory comment into two paragraphs to separate material on the garbage collection complications from general discussion of UNIX domain sockets.
Staticize unp_addsockcred().
Remove XXX comment regarding Giant and v_socket -- v_socket is protected by the global UNIX domain socket lock.
|
#
166712 |
|
14-Feb-2007 |
rwatson |
Minor rearrangement of global variables, comments, etc, in UNIX domain sockets.
|
#
166708 |
|
14-Feb-2007 |
rwatson |
Change unp_mtx to supporting recursion, and do not drop the unp_mtx over sonewconn() in unp_connect(). This avoids a race that occurs due to v_socket being an uncounted reference, as the lock was being released in order to call sonewconn(), which otherwise recurses into the UNIX domain socket code via pru_attach, as well as holding the lock over a sleeping memory allocation in uipc_attach(). Switch to a non-sleeping memory allocation during UNIX domain socket attach.
This fix non-ideal in that it requires enabling recursion, but is a much smaller change than moving to using true references for v_socket. The reported panic occurs in unp_connect() following the return of sonewconn().
Update copyright year.
Panic reported by: jhb
|
#
166691 |
|
13-Feb-2007 |
rwatson |
Set UNP_CONNECTING when committing to moving ahead in unp_connect(). This logic was lost when merging the remainder of these changes in 1.178.
|
#
166534 |
|
06-Feb-2007 |
rwatson |
Push UNIX domain socket locking further into uipc_ctloutput() in order to avoid holding the UNIX domain socket subsystem lock over soooptcopyin() and sooptcopyout(). This problem was introduced when LOCAL_CREDS, and LOCAL_CONNWAIT support were added.
Reviewed by: mdodd
|
#
165889 |
|
08-Jan-2007 |
rwatson |
Canonicalize copyrights in some files I hold copyrights on:
- Sort by date in license blocks, oldest copyright first. - All rights reserved after all copyrights, not just the first. - Use (c) to be consistent with other entries.
MFC after: 3 days
|
#
165810 |
|
05-Jan-2007 |
jhb |
- Close a race between enumerating UNIX domain socket pcb structures via sysctl and socket teardown by adding a reference count to the UNIX domain pcb object and fixing the sysctl that enumerates unpcbs to grab a reference on each unpcb while it builds the list to copy out to userland. - Close a race between UNIX domain pcb garbage collection (unp_gc()) and file descriptor teardown (fdrop()) by adding a new garbage collection flag FWAIT. unp_gc() sets FWAIT while it walks the message buffers in a UNIX domain socket looking for nested file descriptor references and clears the flag when it is finished. fdrop() checks to see if the flag is set on a file descriptor whose refcount just dropped to 0 and waits for unp_gc() to clear the flag before completely destroying the file descriptor.
MFC after: 1 week Reviewed by: rwatson Submitted by: ups Hopefully makes the panics go away: mx1
|
#
163606 |
|
22-Oct-2006 |
rwatson |
Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project Sponsored by: SPARTA
|
#
161264 |
|
13-Aug-2006 |
rwatson |
Minor white space tweaks.
|
#
161040 |
|
07-Aug-2006 |
rwatson |
Move definition of UNIX domain socket protosw and domain entries from uipc_proto.c to uipc_usrreq.c, making localdomain static. Remove uipc_proto.c as it's no longer used. With this change, UNIX domain sockets are entirely encapsulated in uipc_usrreq.c.
|
#
161019 |
|
06-Aug-2006 |
rwatson |
Don't set pru_sosend, pru_soreceive, pru_sopoll to default values, as they are already set to default values.
|
#
160919 |
|
02-Aug-2006 |
rwatson |
Remove now unneeded ENOTCONN clause from SOCK_DGRAM side of uipc_send(): we have to check it regardless of the target address, so don't check it twice.
|
#
160868 |
|
31-Jul-2006 |
rwatson |
Close a race that occurs when using sendto() to connect and send on a UNIX domain socket at the same time as the remote host is closing the new connections as quickly as they open. Since the connect() and send() paths are non-atomic with respect to another, it is possible for the second thread's close() call to disconnect the two sockets as connect() returns, leading to the consumer (which plans to send()) with a NULL kernel pointer to its proposed peer. As a result, after acquiring the UNIX domain socket subsystem lock, we need to revalidate the connection pointers even though connect() has technically succeed, and reurn an error to say that there's no connection on which to perform the send.
We might want to rethink the specific errno number, perhaps ECONNRESET would be better.
PR: 100940 Reported by: Young Hyun <youngh at caida dot org> MFC after: 2 weeks MFC note: Some adaptation will be required
|
#
160720 |
|
26-Jul-2006 |
rwatson |
Remove call to soisdisconnected() in uipc_detach(), since it will already have been invoked by uipc_close() or uipc_abort(), and the socket is in a state of being torn down by the time we get to this point, so kqueue state frobbed by soisdisconnected() is not available, so frobbing it will result in a panic.
Reported by: Munehiro Matsuda <haro at h4 dot dion dot ne dot jp>
|
#
160619 |
|
24-Jul-2006 |
rwatson |
soreceive_generic(), and sopoll_generic(). Add new functions sosend(), soreceive(), and sopoll(), which are wrappers for pru_sosend, pru_soreceive, and pru_sopoll, and are now used univerally by socket consumers rather than either directly invoking the old so*() functions or directly invoking the protocol switch method (about an even split prior to this commit).
This completes an architectural change that was begun in 1996 to permit protocols to provide substitute implementations, as now used by UDP. Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to perform these operations on sockets -- in particular, distributed file systems and socket system calls.
Architectural head nod: sam, gnn, wollman
|
#
160602 |
|
23-Jul-2006 |
rwatson |
Remove duplicate 'or'.
Submitted by: ru
|
#
160600 |
|
23-Jul-2006 |
rwatson |
Add additional comments to the top of the UNIX domain socket implementation providing some high level pointers regarding the implementation.
|
#
160590 |
|
23-Jul-2006 |
rwatson |
Add two new unpcb flags, UNP_BINDING and UNP_CONNECTING, which will be used to mark UNIX domain sockets as being in the process of binding or connecting. Use these to prevent simultaneous bind or connect operations by multiple threads or processes on the same socket at the same time, which closes race conditions present in the UNIX domain socket implementation since inception.
|
#
160589 |
|
23-Jul-2006 |
rwatson |
Merge unp_bind() into uipc_bind(), as it is called only from uipc_bind().
|
#
160588 |
|
23-Jul-2006 |
rwatson |
Since unp_attach() and unp_detach() are now called only from uipc_attach() and uipc_detach(), merge them into their calling functions.
|
#
160587 |
|
23-Jul-2006 |
rwatson |
Move various UNIX socket global variables and sysctls from the middle of the file to the top.
|
#
160584 |
|
22-Jul-2006 |
rwatson |
In uipc_send() and uipc_rcvd(), store unp->unp_conn pointer in unp2 while working with the second unpcb to make the code more clear.
|
#
160583 |
|
22-Jul-2006 |
rwatson |
Re-wrap and other minor formatting and punctuation fixes for UNIX domain socket comments.
|
#
160549 |
|
21-Jul-2006 |
rwatson |
Change semantics of socket close and detach. Add a new protocol switch function, pru_close, to notify protocols that the file descriptor or other consumer of a socket is closing the socket. pru_abort is now a notification of close also, and no longer detaches. pru_detach is no longer used to notify of close, and will be called during socket tear-down by sofree() when all references to a socket evaporate after an earlier call to abort or close the socket. This means detach is now an unconditional teardown of a socket, whereas previously sockets could persist after detach of the protocol retained a reference.
This faciliates sharing mutexes between layers of the network stack as the mutex is required during the checking and removal of references at the head of sofree(). With this change, pru_detach can now assume that the mutex will no longer be required by the socket layer after completion, whereas before this was not necessarily true.
Reviewed by: gnn
|
#
160278 |
|
11-Jul-2006 |
rwatson |
Reduce periods of simultaneous acquisition of various socket buffer locks and the unplock during uipc_rcvd() and uipc_send() by caching certain values from one structure while its locks are held, and applying them to a second structure while its locks are held. If done carefully, this should be correct, and will reduce the amount of work done with the global unp lock held.
Tested by: kris (earlier version)
|
#
159951 |
|
26-Jun-2006 |
rwatson |
Trim basically unused 'unp' in uipc_connect().
|
#
159672 |
|
16-Jun-2006 |
rwatson |
Remove unused (and ifdef'd) unp_abort() and unp_drain().
MFC after: 1 month
|
#
159575 |
|
13-Jun-2006 |
maxim |
o There are two methods to get a process credentials over the unix sockets:
1) A sender sends SCM_CREDS message to a reciever, struct cmsgcred; 2) A reciever sets LOCAL_CREDS socket option and gets sender credentials in control message, struct sockcred.
Both methods use the same control message type SCM_CREDS with the same control message level SOL_SOCKET, so they are indistinguishable for the receiver. A difference in struct cmsgcred and struct sockcred layouts may lead to unwanted effects.
Now for sockets with LOCAL_CREDS option remove all previous linked SCM_CREDS control messages and then add a control message with struct sockcred so the process specifically asked for the peer credentials by LOCAL_CREDS option always gets struct sockcred.
PR: kern/90800 Submitted by: Andrey Simonenko Regres. tests: tools/regression/sockets/unix_cmsg/ MFC after: 1 month
|
#
157999 |
|
24-Apr-2006 |
maxim |
Inherit LOCAL_CREDS option from listen socket for sockets returned by accept(2).
PR: kern/90644 Submitted by: Andrey Simonenko OK'ed by: mdodd Tested by: NetBSD regress/sys/kern/unfdpass/unfdpass.c MFC after: 1 month
|
#
157927 |
|
21-Apr-2006 |
ps |
Allow for nmbclusters and maxsockets to be increased via sysctl. An eventhandler is used to update all the various zones that depend on these values.
|
#
157370 |
|
01-Apr-2006 |
rwatson |
Chance protocol switch method pru_detach() so that it returns void rather than an error. Detaches do not "fail", they other occur or the protocol flags SS_PROTOREF to take ownership of the socket.
soclose() no longer looks at so_pcb to see if it's NULL, relying entirely on the protocol to decide whether it's time to free the socket or not using SS_PROTOREF. so_pcb is now entirely owned and managed by the protocol code. Likewise, no longer test so_pcb in other socket functions, such as soreceive(), which have no business digging into protocol internals.
Protocol detach routines no longer try to free the socket on detach, this is performed in the socket code if the protocol permits it.
In rts_detach(), no longer test for rp != NULL in detach, and likewise in other protocols that don't permit a NULL so_pcb, reduce the incidence of testing for it during detach.
netinet and netinet6 are not fully updated to this change, which will be in an upcoming commit. In their current state they may leak memory or panic.
MFC after: 3 months
|
#
157366 |
|
01-Apr-2006 |
rwatson |
Change protocol switch pru_abort() API so that it returns void rather than an int, as an error here is not meaningful. Modify soabort() to unconditionally free the socket on the return of pru_abort(), and modify most protocols to no longer conditionally free the socket, since the caller will do this.
This commit likely leaves parts of netinet and netinet6 in a situation where they may panic or leak memory, as they have not are not fully updated by this commit. This will be corrected shortly in followup commits to these components.
MFC after: 3 months
|
#
156807 |
|
17-Mar-2006 |
rwatson |
Modify UNIX domain sockets to guarantee, and assume, that so_pcb is always defined for an in-use socket. This allows us to eliminate countless tests of whether so_pcb is non-NULL, eliminating dozens of error cases. For now, retain the call to sotryfree() in the uipc_abort() path, but this will eventually move to soabort().
These new assumptions should be largely correct, and will become more so as the socket/pcb reference model is fixed. Removing the notion that so_pcb can be non-NULL is a critical step towards further fine-graining of the UNIX domain socket locking, as the so_pcb reference no longer needs to be protected using locks, instead it is a property of the socket life cycle.
|
#
155031 |
|
30-Jan-2006 |
jeff |
- Lock access to vrele() with VFS_LOCK_GIANT() rather than mtx_lock(&Giant).
Sponsored by: Isilon Systems, Inc.
|
#
154278 |
|
12-Jan-2006 |
rwatson |
XXX a comment in uipc_usrreq.c that requires updating.
|
#
153427 |
|
14-Dec-2005 |
mux |
Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to match the type of the variable they are exporting.
Spotted by: Thomas Hurst <tom@hur.st> MFC after: 3 days
|
#
152283 |
|
10-Nov-2005 |
rwatson |
Correct a number of serious and closely related bugs in the UNIX domain socket file descriptor garbage collection code, which is intended to detect and clear cycles of orphaned file descriptors that are "in-flight" in a socket when that socket is closed before they are received. The algorithm present was both run at poor times (resulting in recursion and reentrance), and also buggy in the presence of parallelism. In order to fix these problems, make the following changes:
- When there are in-flight sockets and a UNIX domain socket is destroyed, asynchronously schedule the garbage collector, rather than running it synchronously in the current context. This avoids lock order issues when the garbage collection code reenters the UNIX domain socket code, avoiding lock order reversals, deadlocks, etc. Run the code asynchronously in a task queue.
- In the garbage collector, when skipping file descriptors that have entered a closing state (i.e., have f_count == 0), re-test the FDEFER flag, and decrement unp_defer. As file descriptors can now transition to a closed state, while the garbage collector is running, it is no longer the case that unp_defer will remain an accurate count of deferred sockets in the mark portion of the GC algorithm. Otherwise, the garbage collector will loop waiting waiting for unp_defer to reach zero, which it will never do as it is skipping file descriptors that were marked in an earlier pass, but now closed.
- Acquire the UNIX domain socket subsystem lock in unp_discard() when modifying the unp_rights counter, or a read/write race is risked with other threads also manipulating the counter.
While here:
- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in the garbage collector, this is not required as we are able to use the socket buffer receive lock to protect scanning the receive buffer for in-flight file descriptors on the socket buffer.
- Annotate that the description of the garbage collector implementation is increasingly inaccurate and needs to be updated.
- Add counters of the number of deferred garbage collections and recycled file descriptors. This will be removed and is here temporarily for debugging purposes.
With these changes in place, the unp_passfd regression test now appears to be passed consistently on UP and SMP systems for extended runs, whereas before it hung quickly or panicked, depending on which bug was triggered.
Reported by: Philip Kizer <pckizer at nostrum dot com> MFC after: 2 weeks
|
#
151897 |
|
31-Oct-2005 |
rwatson |
Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat.
- Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters.
- Disambiguate some collisions by adding subsystem prefixes to some memory types.
- Generally prefer lower case to upper case.
- If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases.
Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.
|
#
151888 |
|
30-Oct-2005 |
rwatson |
Push the assignment of a new or updated so_qlimit from solisten() following the protocol pru_listen() call to solisten_proto(), so that it occurs under the socket lock acquisition that also sets SO_ACCEPTCONN. This requires passing the new backlog parameter to the protocol, which also allows the protocol to be aware of changes in queue limit should it wish to do something about the new queue limit. This continues a move towards the socket layer acting as a library for the protocol.
Bump __FreeBSD_version due to a change in the in-kernel protocol interface. This change has been tested with IPv4 and UNIX domain sockets, but not other protocols.
|
#
150487 |
|
23-Sep-2005 |
rwatson |
Canonicalize the UNIX domain socket copyright layout: original holders before more recent holders.
MFC after: 3 days
|
#
145978 |
|
06-May-2005 |
cperciva |
Fix two issues which were missed in FreeBSD-SA-05:08.kmem.
Reported by: Uwe Doering
|
#
145492 |
|
24-Apr-2005 |
mdodd |
Add missing break.
Found by: marcus
|
#
145312 |
|
20-Apr-2005 |
mdodd |
Check sopt_level in uipc_ctloutput() and return early if it is non-zero. This prevents unintended consequnces when an application calls things like setsockopt(x, SOL_SOCKET, SO_REUSEADDR, ...) on a Unix domain socket.
|
#
144978 |
|
12-Apr-2005 |
mdodd |
Implement unix(4) socket options LOCAL_CREDS and LOCAL_CONNWAIT.
- Add unp_addsockcred() (for LOCAL_CREDS). - Add an argument to unp_connect2() to differentiate between PRU_CONNECT and PRU_CONNECT2. (for LOCAL_CONNWAIT)
Obtained from: NetBSD (with some changes)
|
#
142190 |
|
21-Feb-2005 |
rwatson |
In the current world order, solisten() implements the state transition of a socket from a regular socket to a listening socket able to accept new connections. As part of this state transition, solisten() calls into the protocol to update protocol-layer state. There were several bugs in this implementation that could result in a race wherein a TCP SYN received in the interval between the protocol state transition and the shortly following socket layer transition would result in a panic in the TCP code, as the socket would be in the TCPS_LISTEN state, but the socket would not have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to to socket "library" routines called from the protocol. This permits the socket routines to be called while holding the protocol mutexes, preventing a race exposing the incomplete socket state transition to TCP after the TCP state transition has completed. The check for a socket layer state transition is performed by solisten_proto_check(), and the actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set, and over the protocol layer state transition, which is now possible as the socket lock is acquired by the protocol layer, rather than vice versa. This prevents additional state related races in the socket layer.
This permits the dual transition of socket layer and protocol layer state to occur while holding locks for both layers, making the two changes atomic with respect to one another. Similar changes are likely require elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc> Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net> Philosophical head nod: gnn
|
#
142174 |
|
21-Feb-2005 |
rwatson |
When aborting a UNIX domain socket bind() because VOP_CREATE() failed, make sure to call vn_finished_write(mp) before returning.
MFC after: 3 days
|
#
142155 |
|
20-Feb-2005 |
rwatson |
style(9)-ize function headers, remove use of 'register'.
MFC after: 3 days
|
#
142139 |
|
20-Feb-2005 |
rwatson |
In unp_attach(), allow uma_zalloc to zero the new unpcb rather than explicitly using bzero().
Update copyright.
MFC after: 3 days
|
#
142118 |
|
20-Feb-2005 |
rwatson |
Move assignment of UNIX domain socket pcb during unp_attach() outside of the global UNIX domain socket mutex: no protection is needed that early in the setup of the UNIX domain socket and socket structures.
MFC after: 3 days
|
#
139804 |
|
06-Jan-2005 |
imp |
/* -> /*- for copyright notices, minor format tweaks as necessary
|
#
139218 |
|
22-Dec-2004 |
rwatson |
Remove temporary debugging printf that was used to detect the presence of a race that had previously caused a panic in order to determine if the fix was for the right problem. It was.
MFC after: 2 weeks
|
#
139211 |
|
22-Dec-2004 |
alc |
Add send buffer locking to uipc_send(). Without this locking a race can occur between a reader and a writer that results in a panic upon close, e.g., "panic: sbflush_locked: cc 4 || mb 0xffffff0052afa400 || mbcnt 0"
Reviewed by: rwatson@ MFC after: 2 weeks
|
#
138261 |
|
01-Dec-2004 |
phk |
"nfiles" is a bad name for a global variable. Call it "openfiles" instead as this is more correct and matches the sysctl variable.
|
#
137386 |
|
08-Nov-2004 |
phk |
Initialize struct pr_userreqs in new/sparse style and fill in common default elements in net_init_domain().
This makes it possible to grep these structures and see any bogosities.
|
#
136682 |
|
18-Oct-2004 |
rwatson |
Push acquisition of the accept mutex out of sofree() into the caller (sorele()/sotryfree()):
- This permits the caller to acquire the accept mutex before the socket mutex, avoiding sofree() having to drop the socket mutex and re-order, which could lead to races permitting more than one thread to enter sofree() after a socket is ready to be free'd.
- This also covers clearing of the so_pcb weak socket reference from the protocol to the socket, preventing races in clearing and evaluation of the reference such that sofree() might be called more than once on the same socket.
This appears to close a race I was able to easily trigger by repeatedly opening and resetting TCP connections to a host, in which the tcp_close() code called as a result of the RST raced with the close() of the accepted socket in the user process resulting in simultaneous attempts to de-allocate the same socket. The new locking increases the overhead for operations that may potentially free the socket, so we will want to revise the synchronization strategy here as we normalize the reference counting model for sockets. The use of the accept mutex in freeing of sockets that are not listen sockets is primarily motivated by the potential need to remove the socket from the incomplete connection queue on its parent (listen) socket, so cleaning up the reference model here may allow us to substantially weaken the synchronization requirements.
RELENG_5_3 candidate.
MFC after: 3 days Reviewed by: dwhite Discussed with: gnn, dwhite, green Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de> Reported by: Vlad <marchenko at gmail dot com>
|
#
134310 |
|
25-Aug-2004 |
rwatson |
Don't hold the UNIX domain socket subsystem lock over the body of the UNIX domain socket garbage collection implementation, as that risks holding the mutex over potentially sleeping operations (as well as introducing some nasty lock order issues, etc). unp_gc() will hold the lock long enough to do necessary deferal checks and set that it's running, but then release it until it needs to reset the gc state.
RELENG_5 candidate.
Discussed with: alfred
|
#
133995 |
|
18-Aug-2004 |
rwatson |
Add UNP_UNLOCK_ASSERT() to asser that the UNIX domain socket subsystem lock is not held.
Rather than annotating that the lock is released after calls to unp_detach() with a comment, annotate with an assertion.
Assert that the UNIX domain socket subsystem lock is not held when unp_externalize() and unp_internalize() are called.
|
#
133803 |
|
16-Aug-2004 |
rwatson |
Always acquire the UNIX domain socket subsystem lock (UNP lock) before dereferencing sotounpcb() and checking its value, as so_pcb is protected by protocol locking, not subsystem locking. This prevents races during close() by one thread and use of ths socket in another.
unp_bind() now assert the UNP lock, and uipc_bind() now acquires the lock around calls to unp_bind().
|
#
133792 |
|
15-Aug-2004 |
rwatson |
Annotate the current UNIX domain socket locking strategies, order, strengths, and weaknesses in a comment. Assert a copyright over the changes made as part of the locking work.
|
#
133709 |
|
14-Aug-2004 |
rwatson |
After completing a name lookup for a target UNIX domain socket to connect to, re-check that the local UNIX domain socket hasn't been closed while we slept, and if so, return EINVAL. This affects the system running both with and without Giant over the network stack, and recent ULE changes appear to cause it to trigger more frequently than previously under load. While here, improve catching of possibly closed UNIX domain sockets in one or two additional circumstances. I have a much larger set of related changes in Perforce, but they require more testing before they can be merged.
One debugging printf is left in place to indicate when such a race takes place: this is typically triggered by a buggy application that simultaenously connect()'s and close()'s a UNIX domain socket file descriptor. I'll remove this at some point in the future, but am interested in seeing how frequently this is reported. In the case of Martin's reported problem, it appears to be a result of a non-thread safe syslog() implementation in the C library, which does not synchronize access to its logging file descriptor.
Reported by: mbr
|
#
132645 |
|
25-Jul-2004 |
rwatson |
In uipc_connect(), assert that the passed thread is curthread, and pass td into unp_connect() instead of reading curthread.
|
#
132325 |
|
17-Jul-2004 |
rwatson |
Drop Giant and acquire the UNIX domain socket subsystem lock a bit earlier in unp_connect() so that vp->v_socket can't change between our copying its value to a local variable and later use of that variable. This may have been responsible for a panic during shutdown that I experienced where simultaneous closing of a listen socket by rpcbind and a new connection being made to rpcbind by mountd.
|
#
131439 |
|
02-Jul-2004 |
alfred |
We allocate an array of pointers to the global file table while not holding the filelist_lock. This means the filelist can change size while allocating. Detect this race and retry the allocation.
|
#
131170 |
|
27-Jun-2004 |
rwatson |
Acquire the socket buffer lock when calling unp_scan() on so->so_rcv.sb_mb to prevent the mbuf chain from changing during the scan.
|
#
131151 |
|
26-Jun-2004 |
rwatson |
Reduce the number of unnecessary unlock-relocks on socket buffer mutexes associated with performing a wakeup on the socket buffer:
- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly acquire the socket buffer lock and use the _locked() variants of both calls. Note that the _locked() sowakeup() versions unlock the mutex on return. This is done in uipc_send(), divert_packet(), mroute socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().
- When the socket buffer lock is dropped before a sowakeup(), remove the explicit unlock and use the _locked() sowakeup() variant. This is done in soisdisconnecting(), soisdisconnected() when setting the can't send/ receive flags and dropping data, and in uipc_rcvd() which adjusting back-pressure on the sockets.
For UNIX domain sockets running mpsafe with a contention-intensive SMP mysql benchmark, this results in a 1.6% query rate improvement due to reduce mutex costs.
|
#
131109 |
|
25-Jun-2004 |
rwatson |
Release UNIX domain socket subsystem lock earlier -- don't need to hold it over free of unp_addr if we've already removed all references to unp.
|
#
130831 |
|
20-Jun-2004 |
rwatson |
Merge next step in socket buffer locking:
- sowakeup() now asserts the socket buffer lock on entry. Move the call to KNOTE higher in sowakeup() so that it is made with the socket buffer lock held for consistency with other calls. Release the socket buffer lock prior to calling into pgsigio(), so_upcall(), or aio_swake(). Locking for this event management will need revisiting in the future, but this model avoids lock order reversals when upcalls into other subsystems result in socket/socket buffer operations. Assert that the socket buffer lock is not held at the end of the function.
- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now have _locked versions which assert the socket buffer lock on entry. If a wakeup is required by sb_notify(), invoke sowakeup(); otherwise, unconditionally release the socket buffer lock. This results in the socket buffer lock being released whether a wakeup is required or not.
- Break out socantsendmore() into socantsendmore_locked() that asserts the socket buffer lock. socantsendmore() unconditionally locks the socket buffer before calling socantsendmore_locked(). Note that both functions return with the socket buffer unlocked as socantsendmore_locked() calls sowwakeup_locked() which has the same properties. Assert that the socket buffer is unlocked on return.
- Break out socantrcvmore() into socantrcvmore_locked() that asserts the socket buffer lock. socantrcvmore() unconditionally locks the socket buffer before calling socantrcvmore_locked(). Note that both functions return with the socket buffer unlocked as socantrcvmore_locked() calls sorwakeup_locked() which has similar properties. Assert that the socket buffer is unlocked on return.
- Break out sbrelease() into a sbrelease_locked() that asserts the socket buffer lock. sbrelease() unconditionally locks the socket buffer before calling sbrelease_locked(). sbrelease_locked() now invokes sbflush_locked() instead of sbflush().
- Assert the socket buffer lock in socket buffer sanity check functions sblastrecordchk(), sblastmbufchk().
- Assert the socket buffer lock in SBLINKRECORD().
- Break out various sbappend() functions into sbappend_locked() (and variations on that name) that assert the socket buffer lock. The !_locked() variations unconditionally lock the socket buffer before calling their _locked counterparts. Internally, make sure to call _locked() support routines, etc, if already holding the socket buffer lock.
- Break out sbinsertoob() into sbinsertoob_locked() that asserts the socket buffer lock. sbinsertoob() unconditionally locks the socket buffer before calling sbinsertoob_locked().
- Break out sbflush() into sbflush_locked() that asserts the socket buffer lock. sbflush() unconditionally locks the socket buffer before calling sbflush_locked(). Update panic strings for new function names.
- Break out sbdrop() into sbdrop_locked() that asserts the socket buffer lock. sbdrop() unconditionally locks the socket buffer before calling sbdrop_locked().
- Break out sbdroprecord() into sbdroprecord_locked() that asserts the socket buffer lock. sbdroprecord() unconditionally locks the socket buffer before calling sbdroprecord_locked().
- sofree() now calls socantsendmore_locked() and re-acquires the socket buffer lock on return. It also now calls sbrelease_locked().
- sorflush() now calls socantrcvmore_locked() and re-acquires the socket buffer lock on return. Clean up/mess up other behavior in sorflush() relating to the temporary stack copy of the socket buffer used with dom_dispose by more properly initializing the temporary copy, and selectively bzeroing/copying more carefully to prevent WITNESS from getting confused by improperly initialized mutexes. Annotate why that's necessary, or at least, needed.
- soisconnected() now calls sbdrop_locked() before unlocking the socket buffer to avoid locking overhead.
Some parts of this change were:
Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
|
#
130820 |
|
20-Jun-2004 |
rwatson |
In uipc_rcvd(), lock the socket buffers at either end of the UNIX domain sokcet when updating fields at both ends.
Submitted by: sam Sponsored by: FreeBSD Foundation
|
#
130817 |
|
20-Jun-2004 |
rwatson |
Hold SOCK_LOCK(so) when frobbing so_state when disconnecting a connected UNIX domain datagram socket.
|
#
130640 |
|
17-Jun-2004 |
phk |
Second half of the dev_t cleanup.
The big lines are: NODEV -> NULL NOUDEV -> NODEV udev_t -> dev_t udev2dev() -> findcdev()
Various minor adjustments including handling of userland access to kernel space struct cdev etc.
|
#
130480 |
|
14-Jun-2004 |
rwatson |
The socket field so_state is used to hold a variety of socket related flags relating to several aspects of socket functionality. This change breaks out several bits relating to send and receive operation into a new per-socket buffer field, sb_state, in order to facilitate locking. This is required because, in order to provide more granular locking of sockets, different state fields have different locking properties. The following fields are moved to sb_state:
SS_CANTRCVMORE (so_state) SS_CANTSENDMORE (so_state) SS_RCVATMARK (so_state)
Rename respectively to:
SBS_CANTRCVMORE (so_rcv.sb_state) SBS_CANTSENDMORE (so_snd.sb_state) SBS_RCVATMARK (so_rcv.sb_state)
This facilitates locking by isolating fields to be located with other identically locked fields, and permits greater granularity in socket locking by avoiding storing fields with different locking semantics in the same short (avoiding locking conflicts). In the future, we may wish to coallesce sb_state and sb_flags; for the time being I leave them separate and there is no additional memory overhead due to the packing/alignment of shorts in the socket buffer structure.
|
#
130398 |
|
13-Jun-2004 |
rwatson |
Socket MAC labels so_label and so_peerlabel are now protected by SOCK_LOCK(so):
- Hold socket lock over calls to MAC entry points reading or manipulating socket labels.
- Assert socket lock in MAC entry point implementations.
- When externalizing the socket label, first make a thread-local copy while holding the socket lock, then release the socket lock to externalize to userspace.
|
#
130387 |
|
12-Jun-2004 |
rwatson |
Extend coverage of SOCK_LOCK(so) to include so_count, the socket reference count:
- Assert SOCK_LOCK(so) macros that directly manipulate so_count: soref(), sorele().
- Assert SOCK_LOCK(so) in macros/functions that rely on the state of so_count: sofree(), sotryfree().
- Acquire SOCK_LOCK(so) before calling these functions or macros in various contexts in the stack, both at the socket and protocol layers.
- In some cases, perform soisdisconnected() before sotryfree(), as this could result in frobbing of a non-present socket if sotryfree() actually frees the socket.
- Note that sofree()/sotryfree() will release the socket lock even if they don't free the socket.
Submitted by: sam Sponsored by: FreeBSD Foundation Obtained from: BSD/OS
|
#
130316 |
|
10-Jun-2004 |
rwatson |
Introduce a subsystem lock around UNIX domain sockets in order to protect global and allocated variables. This strategy is derived from work originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code, drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to the unpcb zone not returning memory for reuse by other subsystems (consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of both a global lock and per-unpcb locks. However, in practice, the global lock covered all accesses, so I have simplified out the unpcb locks in the interest of getting this merged faster (reducing the overhead but not sacrificing granularity in most cases). We will want to explore possibilities for improving lock granularity in this code in the future.
Submitted by: sam Sponsored by: FreeBSD Foundatiuon Obtained from: BSD/OS 5 snapshot provided by BSDi
|
#
130050 |
|
04-Jun-2004 |
rwatson |
Mark sun_noname as const since it's immutable. Update definitions of functions that potentially accept &sun_noname (sbappendaddr(), et al) to accept a const sockaddr pointer.
|
#
127911 |
|
05-Apr-2004 |
imp |
Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999.
Approved by: core
|
#
127652 |
|
30-Mar-2004 |
rwatson |
Export uipc_connect2() from uipc_usrreq.c instead of unp_connect2(), and consume that interface in portalfs and fifofs instead. In the new world order, unp_connect2() assumes that the unpcb mutex is held, whereas uipc_connect2() validates that the passed sockets are UNIX domain sockets, then grabs the mutex.
NB: the portalfs and fifofs code gets down and dirty with UNIX domain sockets. Maybe this is a bad thing.
|
#
127599 |
|
30-Mar-2004 |
rwatson |
Prefer NULL to 0 when testing and assigning pointer values.
|
#
126425 |
|
01-Mar-2004 |
rwatson |
Rename dup_sockaddr() to sodupsockaddr() for consistency with other functions in kern_socket.c.
Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT in from the caller context rather than "1" or "0".
Correct mflags pass into mac_init_socket() from previous commit to not include M_ZERO.
Submitted by: sam
|
#
126103 |
|
21-Feb-2004 |
cperciva |
If we're going to panic(), do it before dereferencing a NULL pointer.
Reported by: "Ted Unangst" <tedu@coverity.com> Approved by: rwatson (mentor)
|
#
124602 |
|
16-Jan-2004 |
des |
Restore correct semantics for F_DUPFD fcntl. This should fix the errors people have been getting with configure scripts.
|
#
124548 |
|
15-Jan-2004 |
des |
New file descriptor allocation code, derived from similar code introduced in OpenBSD by Niels Provos. The patch introduces a bitmap of allocated file descriptors which is used to locate available descriptors when a new one is needed. It also moves the task of growing the file descriptor table out of fdalloc(), reducing complexity in both fdalloc() and do_dup().
Debts of gratitude are owed to tjr@ (who provided the original patch on which this work is based), grog@ (for the gdb(4) man page) and rwatson@ (for assistance with pxeboot(8)).
|
#
124392 |
|
11-Jan-2004 |
des |
Mechanical whitespace cleanup; parenthesize return values; other minor style nits.
|
#
122875 |
|
17-Nov-2003 |
rwatson |
Introduce a MAC label reference in 'struct inpcb', which caches the MAC label referenced from 'struct socket' in the IPv4 and IPv6-based protocols. This permits MAC labels to be checked during network delivery operations without dereferencing inp->inp_socket to get to so->so_label, which will eventually avoid our having to grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the MAC Framework, along with the normal circus of entry points: initialization, creation from socket, destruction, as well as a delivery access control check.
For most policies, the inpcb label will simply be a cache of the socket label, so a new protocol switch method is introduced, pr_sosetlabel() to notify protocols that the socket layer label has been updated so that the cache can be updated while holding appropriate locks. Most protocols implement this using pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use the the worker function in_pcbsosetlabel(), which calls into the MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub policy, and test policy.
Reviewed by: sam, bms Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
|
#
116182 |
|
10-Jun-2003 |
obrien |
Use __FBSDID().
|
#
112516 |
|
23-Mar-2003 |
cognet |
s/discriptors/descriptors/
|
#
111119 |
|
19-Feb-2003 |
imp |
Back out M_* changes, per decision of the TRB.
Approved by: trb
|
#
110908 |
|
15-Feb-2003 |
alfred |
Do not allow kqueues to be passed via unix domain sockets.
|
#
110430 |
|
05-Feb-2003 |
hsu |
Remove vestiges of no longer needed unp_rvnode field.
Approved by: phk (who originally added it in rev 1.8 of unpcb.h)
|
#
110234 |
|
02-Feb-2003 |
alfred |
Catch more uses of MIN().
|
#
109809 |
|
24-Jan-2003 |
hsu |
Remove extraneous FILEDESC_LOCKs around atomic reads.
Reviewed by: jhb
|
#
109698 |
|
22-Jan-2003 |
ume |
Added comment why this workaround is required.
Suggested by: sam MFC after: 1 week
|
#
109681 |
|
22-Jan-2003 |
ume |
getpeername() returns with no error but didn't fill struct sockaddr correctly against PF_LOCAL. It seems that the test always fails then sockaddr was not filled. So, I added else clause for workaround. I doubt if it is right fix. However, it is better than nothing. I found that NetBSD has same potential problem. But, fortunately, NetBSD has equivalent else clause.
MFC after: 1 week
|
#
109623 |
|
21-Jan-2003 |
alfred |
Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
|
#
109153 |
|
12-Jan-2003 |
dillon |
Bow to the whining masses and change a union back into void *. Retain removal of unnecessary casts and throw in some minor cleanups to see if anyone complains, just for the hell of it.
|
#
109123 |
|
11-Jan-2003 |
dillon |
Change struct file f_data to un_data, a union of the correct struct pointer types, and remove a huge number of casts from code using it.
Change struct xfile xf_data to xun_data (ABI is still compatible).
If we need to add a #define for f_data and xf_data we can, but I don't think it will be necessary. There are no operational changes in this commit.
|
#
108267 |
|
25-Dec-2002 |
hsu |
Ensure that the made-up inode number for a Unix domain socket is persistent.
|
#
106096 |
|
28-Oct-2002 |
rwatson |
Trim extraneous #else and #endif MAC comments per style(9).
|
#
105332 |
|
17-Oct-2002 |
robert |
- Allocate only enough space for a temporary buffer to hold the path including the terminating NUL character from `struct sockaddr_un' rather than SOCK_MAXADDRLEN bytes. - Use strlcpy() instead of strncpy() to copy strings.
|
#
101126 |
|
31-Jul-2002 |
rwatson |
Introduce support for Mandatory Access Control and extensible kernel access control.
Authorize the creation of UNIX domain sockets in the file system namespace via an appropriate invocation a MAC framework entry point.
Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
|
#
101125 |
|
31-Jul-2002 |
rwatson |
When invoking NDINIT() in preparation for CREATE, set SAVENAME since we'll use nd.ni_cnp later.
Submitted by: green Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
|
#
101013 |
|
31-Jul-2002 |
rwatson |
Introduce support for Mandatory Access Control and extensible kernel access control.
Invoke the necessary MAC entry points to maintain labels on sockets. In particular, invoke entry points during socket allocation and destruction, as well as creation by a process or during an accept-scenario (sonewconn). For UNIX domain sockets, also assign a peer label. As the socket code isn't locked down yet, locking interactions are not yet clear. Various protocol stack socket operations (such as peer label assignment for IPv4) will follow.
Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
|
#
98994 |
|
28-Jun-2002 |
alfred |
nuke caddr_t.
|
#
97658 |
|
31-May-2002 |
tanimura |
Back out my lats commit of locking down a socket, it conflicts with hsu's work.
Requested by: hsu
|
#
96972 |
|
20-May-2002 |
tanimura |
Lock down a socket, milestone 1.
o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a socket buffer. The mutex in the receive buffer also protects the data in struct socket.
o Determine the lock strategy for each members in struct socket.
o Lock down the following members:
- so_count - so_options - so_linger - so_state
o Remove *_locked() socket APIs. Make the following socket APIs touching the members above now require a locked socket:
- sodisconnect() - soisconnected() - soisconnecting() - soisdisconnected() - soisdisconnecting() - sofree() - soref() - sorele() - sorwakeup() - sotryfree() - sowakeup() - sowwakeup()
Reviewed by: alfred
|
#
95759 |
|
29-Apr-2002 |
tanimura |
Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.
Requested by: bde
Since locking sigio_lock is usually followed by calling pgsigio(), move the declaration of sigio_lock and the definitions of SIGIO_*() to sys/signalvar.h.
While I am here, sort include files alphabetically, where possible.
|
#
93076 |
|
24-Mar-2002 |
bde |
Fixed some style bugs in the removal of __P(()). The main ones were not removing tabs before "__P((", and not outdenting continuation lines to preserve non-KNF lining up of code with parentheses. Switch to KNF formatting and/or rewrap the whole prototype in some cases.
|
#
92759 |
|
20-Mar-2002 |
jeff |
Add calls to uma_zone_set_max() to restore previously enforced limits.
|
#
92752 |
|
20-Mar-2002 |
jeff |
Remove references to vm_zone.h and switch over to the new uma API.
|
#
92723 |
|
19-Mar-2002 |
alfred |
Remove __P.
|
#
92654 |
|
19-Mar-2002 |
jeff |
This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator.
Reviewed by: arch@
|
#
91418 |
|
27-Feb-2002 |
jhb |
- Change unp_listen() to accept a thread rather than a proc as its second argument. - Use td_ucred in unp_listen() instead of p_ucred.
|
#
91406 |
|
27-Feb-2002 |
jhb |
Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
|
#
91354 |
|
27-Feb-2002 |
dd |
Introduce a version field to `struct xucred' in place of one of the spares (the size of the field was changed from u_short to u_int to reflect what it really ends up being). Accordingly, change users of xucred to set and check this field as appropriate. In the kernel, this is being done inside the new cru2x() routine which takes a `struct ucred' and fills out a `struct xucred' according to the former. This also has the pleasant sideaffect of removing some duplicate code.
Reviewed by: rwatson
|
#
91210 |
|
24-Feb-2002 |
iedowse |
Sockets passed into uipc_abort() have been allocated by sonewconn() but never accept'ed, so they must be destroyed. Originally, unp_drop() detected this situation by checking if so->so_head is non-NULL. However, since revision 1.54 of uipc_socket.c (Feb 1999), so->so_head is set to NULL before calling soabort(), so any unix-domain sockets waiting to be accept'ed are leaked if the server socket is closed.
Resolve this by moving the socket destruction code into uipc_abort() itself, and making it unconditional (the other caller of unp_drop() never needs the socket to be destroyed). Use unp_detach() to avoid the original code duplication when destroying the socket.
PR: kern/17895 Reviewed by: dwmalone (an earlier version of the patch) MFC after: 1 week
|
#
89371 |
|
14-Jan-2002 |
alfred |
Remove a bogus FILEDESC_UNLOCK.
Submitted by: tanimura
|
#
89306 |
|
13-Jan-2002 |
alfred |
SMP Lock struct file, filedesc and the global file list.
Seigo Tanimura (tanimura) posted the initial delta.
I've polished it quite a bit reducing the need for locking and adapting it for KSE.
Locks:
1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked.
1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex.
1 sx lock for the global filelist.
struct file * fhold(struct file *fp); /* increments reference count on a file */
struct file * fhold_locked(struct file *fp); /* like fhold but expects file to locked */
struct file * ffind_hold(struct thread *, int fd); /* finds the struct file in thread, adds one reference and returns it unlocked */
struct file * ffind_lock(struct thread *, int fd); /* ffind_hold, but returns file locked */
I still have to smp-safe the fget cruft, I'll get to that asap.
|
#
87821 |
|
13-Dec-2001 |
rwatson |
o Back out portions of 1.50 and 1.47, eliminating sonewconn3() and always deriving the credential for a newly accepted connection from the listen socket. Previously, the selection of the credential depended on the protocol: UNIX domain sockets would use the connecting process's credential, and protocols supporting a creation of the socket before the receiving end called accept() would use the listening socket. After this change, it is always the listening credential.
Reviewed by: green
|
#
86487 |
|
17-Nov-2001 |
dillon |
Give struct socket structures a ref counting interface similar to vnodes. This will hopefully serve as a base from which we can expand the MP code. We currently do not attempt to obtain any mutex or SX locks, but the door is open to add them when we nail down exactly how that part of it is going to work.
|
#
86183 |
|
08-Nov-2001 |
rwatson |
o Replace reference to 'struct proc' with 'struct thread' in 'struct sysctl_req', which describes in-progress sysctl requests. This permits sysctl handlers to have access to the current thread, permitting work on implementing td->td_ucred, migration of suser() to using struct thread to derive the appropriate ucred, and allowing struct thread to be passed down to other code, such as network code where td is not currently available (and curproc is used).
o Note: netncp and netsmb are not updated to reflect this change, as they are not currently KSE-adapted.
Reviewed by: julian Obtained from: TrustedBSD Project
|
#
85706 |
|
29-Oct-2001 |
dwmalone |
When scanning for control messages, don't process the data mbufs. This could cause hangs if a unix domain socket was closed with data still to be read from it.
Tested by: Andrea Campi <andrea@webcom.it>
|
#
84736 |
|
09-Oct-2001 |
rwatson |
- Combine kern.ps_showallprocs and kern.ipc.showallsockets into a single kern.security.seeotheruids_permitted, describes as: "Unprivileged processes may see subjects/objects with different real uid" NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is an API change. kern.ipc.showallsockets does not. - Check kern.security.seeotheruids_permitted in cr_cansee(). - Replace visibility calls to socheckuid() with cr_cansee() (retain the change to socheckuid() in ipfw, where it is used for rule-matching). - Remove prison_unpcb() and make use of cr_cansee() against the UNIX domain socket credential instead of comparing root vnodes for the UDS and the process. This allows multiple jails to share the same chroot() and not see each others UNIX domain sockets. - Remove unused socheckproc().
Now that cr_cansee() is used universally for socket visibility, a variety of policies are more consistently enforced, including uid-based restrictions and jail-based restrictions. This also better-supports the introduction of additional MAC models.
Reviewed by: ps, billf Obtained from: TrustedBSD Project
|
#
84527 |
|
05-Oct-2001 |
ps |
Only allow users to see their own socket connections if kern.ipc.showallsockets is set to 0.
Submitted by: billf (with modifications by me) Inspired by: Dave McKay (aka pm aka Packet Magnet) Reviewed by: peter MFC after: 2 weeks
|
#
84472 |
|
04-Oct-2001 |
dwmalone |
Hopefully improve control message passing over Unix domain sockets.
1) Allow the sending of more than one control message at a time over a unix domain socket. This should cover the PR 29499.
2) This requires that unp_{ex,in}ternalize and unp_scan understand mbufs with more than one control message at a time.
3) Internalize and externalize used to work on the mbuf in-place. This made life quite complicated and the code for sizeof(int) < sizeof(file *) could end up doing the wrong thing. The patch always create a new mbuf/cluster now. This resulted in the change of the prototype for the domain externalise function.
4) You can now send SCM_TIMESTAMP messages.
5) Always use CMSG_DATA(cm) to determine the start where the data in unp_{ex,in}ternalize. It was using ((struct cmsghdr *)cm + 1) in some places, which gives the wrong alignment on the alpha. (NetBSD made this fix some time ago).
This results in an ABI change for discriptor passing and creds passing on the alpha. (Probably on the IA64 and Spare ports too).
6) Fix userland programs to use CMSG_* macros too.
7) Be more careful about freeing mbufs containing (file *)s. This is made possible by the prototype change of externalise.
PR: 29499 MFC after: 6 weeks
|
#
83366 |
|
12-Sep-2001 |
julian |
KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
|
#
81907 |
|
19-Aug-2001 |
julian |
Forgot to remove this un-needed test. (M_WAITOK won't fail) I vaguely remember someone once proving it COULD return NULL.. was that changed?
Reminded by: BDE
MFC after: 2 weeks
|
#
81892 |
|
18-Aug-2001 |
julian |
fix typo
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
|
#
81875 |
|
18-Aug-2001 |
julian |
Don't alocate a 400 byte buffer on the stack, Nor 800 bytes of structures..
MFC after: 2 weeks
|
#
81857 |
|
17-Aug-2001 |
dd |
Implement a LOCAL_PEERCRED socket option which returns a `struct xucred` with the credentials of the connected peer. Obviously this only works (and makes sense) on SOCK_STREAM sockets. This works for both the connect(2) and listen(2) callers.
There is precise documentation of the semantics in unix(4).
Reviewed by: dwmalone (eyeballed)
|
#
77183 |
|
25-May-2001 |
rwatson |
o Merge contents of struct pcred into struct ucred. Specifically, add the real uid, saved uid, real gid, and saved gid to ucred, as well as the pcred->pc_uidinfo, which was associated with the real uid, only rename it to cr_ruidinfo so as not to conflict with cr_uidinfo, which corresponds to the effective uid. o Remove p_cred from struct proc; add p_ucred to struct proc, replacing original macro that pointed. p->p_ucred to p->p_cred->pc_ucred. o Universally update code so that it makes use of ucred instead of pcred, p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo, cr_{r,sv}{u,g}id instead of p_*, etc. o Remove pcred0 and its initialization from init_main.c; initialize cr_ruidinfo there. o Restruction many credential modification chunks to always crdup while we figure out locking and optimizations; generally speaking, this means moving to a structure like this: newcred = crdup(oldcred); ... p->p_ucred = newcred; crfree(oldcred); It's not race-free, but better than nothing. There are also races in sys_process.c, all inter-process authorization, fork, exec, and exit. o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid; remove comments indicating that the old arrangement was a problem. o Restructure exec1() a little to use newcred/oldcred arrangement, and use improved uid management primitives. o Clean up exit1() so as to do less work in credential cleanup due to pcred removal. o Clean up fork1() so as to do less work in credential cleanup and allocation. o Clean up ktrcanset() to take into account changes, and move to using suser_xxx() instead of performing a direct uid==0 comparision. o Improve commenting in various kern_prot.c credential modification calls to better document current behavior. In a couple of places, current behavior is a little questionable and we need to check POSIX.1 to make sure it's "right". More commenting work still remains to be done. o Update credential management calls, such as crfree(), to take into account new ruidinfo reference. o Modify or add the following uid and gid helper routines: change_euid() change_egid() change_ruid() change_rgid() change_svuid() change_svgid() In each case, the call now acts on a credential not a process, and as such no longer requires more complicated process locking/etc. They now assume the caller will do any necessary allocation of an exclusive credential reference. Each is commented to document its reference requirements. o CANSIGIO() is simplified to require only credentials, not processes and pcreds. o Remove lots of (p_pcred==NULL) checks. o Add an XXX to authorization code in nfs_lock.c, since it's questionable, and needs to be considered carefully. o Simplify posix4 authorization code to require only credentials, not processes and pcreds. Note that this authorization, as well as CANSIGIO(), needs to be updated to use the p_cansignal() and p_cansched() centralized authorization routines, as they currently do not take into account some desirable restrictions that are handled by the centralized routines, as well as being inconsistent with other similar authorization instances. o Update libkvm to take these changes into account.
Obtained from: TrustedBSD Project Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
|
#
76166 |
|
01-May-2001 |
markm |
Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
|
#
75917 |
|
24-Apr-2001 |
tmm |
Change uipc_sockaddr so that a sockaddr_un without a path is returned nam for an unbound socket instead of leaving nam untouched in that case. This way, the getsockname() output can be used to determine the address family of such sockets (AF_LOCAL).
Reviewed by: iedowse Approved by: rwatson
|
#
72786 |
|
21-Feb-2001 |
rwatson |
o Move per-process jail pointer (p->pr_prison) to inside of the subject credential structure, ucred (cr->cr_prison). o Allow jail inheritence to be a function of credential inheritence. o Abstract prison structure reference counting behind pr_hold() and pr_free(), invoked by the similarly named credential reference management functions, removing this code from per-ABI fork/exit code. o Modify various jail() functions to use struct ucred arguments instead of struct proc arguments. o Introduce jailed() function to determine if a credential is jailed, rather than directly checking pointers all over the place. o Convert PRISON_CHECK() macro to prison_check() function. o Move jail() function prototypes to jail.h. o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the flag in the process flags field itself. o Eliminate that "const" qualifier from suser/p_can/etc to reflect mutex use.
Notes:
o Some further cleanup of the linux/jail code is still required. o It's now possible to consider resolving some of the process vs credential based permission checking confusion in the socket code. o Mutex protection of struct prison is still not present, and is required to protect the reference count plus some fields in the structure.
Reviewed by: freebsd-arch Obtained from: TrustedBSD Project
|
#
70254 |
|
21-Dec-2000 |
bmilekic |
* Rename M_WAIT mbuf subsystem flag to M_TRYWAIT. This is because calls with M_WAIT (now M_TRYWAIT) may not wait forever when nothing is available for allocation, and may end up returning NULL. Hopefully we now communicate more of the right thing to developers and make it very clear that it's necessary to check whether calls with M_(TRY)WAIT also resulted in a failed allocation. M_TRYWAIT basically means "try harder, block if necessary, but don't necessarily wait forever." The time spent blocking is tunable with the kern.ipc.mbuf_wait sysctl. M_WAIT is now deprecated but still defined for the next little while.
* Fix a typo in a comment in mbuf.h
* Fix some code that was actually passing the mbuf subsystem's M_WAIT to malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the value of the M_WAIT flag, this could have became a big problem.
|
#
67708 |
|
27-Oct-2000 |
phk |
Convert all users of fldoff() to offsetof(). fldoff() is bad because it only takes a struct tag which makes it impossible to use unions, typedefs etc.
Define __offsetof() in <machine/ansi.h>
Define offsetof() in terms of __offsetof() in <stddef.h> and <sys/types.h>
Remove myriad of local offsetof() definitions.
Remove includes of <stddef.h> in kernel code.
NB: Kernelcode should *never* include from /usr/include !
Make <sys/queue.h> include <machine/ansi.h> to avoid polluting the API.
Deprecate <struct.h> with a warning. The warning turns into an error on 01-12-2000 and the file gets removed entirely on 01-01-2001.
Paritials reviews by: various. Significant brucifications by: bde
|
#
65495 |
|
05-Sep-2000 |
truckman |
Remove uidinfo hash table lookup and maintenance out of chgproccnt() and chgsbsize(), which are called rather frequently and may be called from an interrupt context in the case of chgsbsize(). Instead, do the hash table lookup and maintenance when credentials are changed, which is a lot less frequent. Add pointers to the uidinfo structures to the ucred and pcred structures for fast access. Pass a pointer to the credential to chgproccnt() and chgsbsize() instead of passing the uid. Add a reference count to the uidinfo structure and use it to decide when to free the structure rather than freeing the structure when the resource consumption drops to zero. Move the resource tracking code from kern_proc.c to kern_resource.c. Move some duplicate code sequences in kern_prot.c to separate helper functions. Change KASSERTs in this code to unconditional tests and calls to panic().
|
#
65198 |
|
29-Aug-2000 |
green |
Remove any possibility of hiwat-related race conditions by changing the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and a u_long target to set it to. The whole thing is splnet().
This fixes a problem that jdp has been able to provoke.
|
#
62976 |
|
11-Jul-2000 |
mckusick |
Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed.
Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
|
#
62573 |
|
04-Jul-2000 |
phk |
Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.
Pointed out by: bde
|
#
62454 |
|
03-Jul-2000 |
phk |
Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:
Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our sources:
-sysctl_vm_zone SYSCTL_HANDLER_ARGS +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
|
#
61976 |
|
22-Jun-2000 |
alfred |
fix races in the uidinfo subsystem, several problems existed:
1) while allocating a uidinfo struct malloc is called with M_WAITOK, it's possible that while asleep another process by the same user could have woken up earlier and inserted an entry into the uid hash table. Having redundant entries causes inconsistancies that we can't handle.
fix: do a non-waiting malloc, and if that fails then do a blocking malloc, after waking up check that no one else has inserted an entry for us already.
2) Because many checks for sbsize were done as "test then set" in a non atomic manner it was possible to exceed the limits put up via races.
fix: instead of querying the count then setting, we just attempt to set the count and leave it up to the function to return success or failure.
3) The uidinfo code was inlining and repeating, lookups and insertions and deletions needed to be in their own functions for clarity.
Reviewed by: green
|
#
57859 |
|
09-Mar-2000 |
shin |
Enable SCM_RIGHTS on alpha. Allocate necessary buffer as conversion between int and struct file *.
Approved by: jkh
Submitted by: brian Reviewed by: bde, brian, peter
|
#
54655 |
|
15-Dec-1999 |
eivind |
Introduce NDFREE (and remove VOP_ABORTOP)
|
#
53212 |
|
16-Nov-1999 |
phk |
This is a partial commit of the patch from PR 14914:
Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures.
This batch of changes compile to the same object files.
Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
|
#
52128 |
|
11-Oct-1999 |
peter |
Trim unused options (or #ifdef for undoc options).
Submitted by: phk
|
#
52070 |
|
09-Oct-1999 |
green |
Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total usage limit.
|
#
51801 |
|
29-Sep-1999 |
guido |
Do not follow symlinks when binding a unix domain socket.
This fixes the ssh 1.2.27 vulnerability as reported in bugtraq.
|
#
51381 |
|
19-Sep-1999 |
green |
Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually. Make a sonewconn3() which takes an extra argument (proc) so new sockets created with sonewconn() from a user's system call get the correct credentials, not just the parent's credentials.
|
#
51357 |
|
17-Sep-1999 |
green |
Get rid of some evil defines (a pair of snd and rcv.)
|
#
50477 |
|
27-Aug-1999 |
peter |
$Id$ -> $FreeBSD$
|
#
47028 |
|
11-May-1999 |
phk |
Divorce "dev_t" from the "major|minor" bitmap, which is now called udev_t in the kernel but still called dev_t in userland.
Provide functions to manipulate both types: major() umajor() minor() uminor() makedev() umakedev() dev2udev() udev2dev()
For now they're functions, they will become in-line functions after one of the next two steps in this process.
Return major/minor/makedev to macro-hood for userland.
Register a name in cdevsw[] for the "filedescriptor" driver.
In the kernel the udev_t appears in places where we have the major/minor number combination, (ie: a potential device: we may not have the driver nor the device), like in inodes, vattr, cdevsw registration and so on, whereas the dev_t appears where we carry around a reference to a actual device.
In the future the cdevsw and the aliased-from vnode will be hung directly from the dev_t, along with up to two softc pointers for the device driver and a few houskeeping bits. This will essentially replace the current "alias" check code (same buck, bigger bang).
A little stunt has been provided to try to catch places where the wrong type is being used (dev_t vs udev_t), if you see something not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if it makes a difference. If it does, please try to track it down (many hands make light work) or at least try to reproduce it as simply as possible, and describe how to do that.
Without DEVT_FASCIST I belive this patch is a no-op.
Stylistic/posixoid comments about the userland view of the <sys/*.h> files welcome now, from userland they now contain the end result.
Next planned step: make all dev_t's refer to the same devsw[] which means convert BLK's to CHR's at the perimeter of the vnodes and other places where they enter the game (bootdev, mknod, sysctl).
|
#
46919 |
|
10-May-1999 |
truckman |
Fix descriptor leak provoked by KKIS.05051999.003b exploit code.
unp_internalize() takes a reference to the descriptor. If the send fails after unp_internalize(), the control mbuf would be freed ophaning the reference.
Tested in -CURRENT by: Pierre Beyssac <beyssac@enst.fr>
|
#
46155 |
|
28-Apr-1999 |
phk |
This Implements the mumbled about "Jail" feature.
This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname.
Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/
|
#
45620 |
|
12-Apr-1999 |
eivind |
More consistent with surrounding style. (Hey - it looked great in the diff...)
Prodded by: bde
|
#
45568 |
|
11-Apr-1999 |
eivind |
Staticize.
|
#
44078 |
|
16-Feb-1999 |
dfr |
* Change sysctl from using linker_set to construct its tree using SLISTs. This makes it possible to change the sysctl tree at runtime.
* Change KLD to find and register any sysctl nodes contained in the loaded file and to unregister them when the file is unloaded.
Reviewed by: Archie Cobbs <archie@whistle.com>, Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
|
#
42960 |
|
21-Jan-1999 |
dillon |
The code that reclaims descriptors from in-transit unix domain descriptor-passing messages was calling sorflush() without checking to see if the descriptor was actually a socket. This can cause a crash by exiting programs that use the mechanism under certain circumstances.
|
#
42957 |
|
21-Jan-1999 |
dillon |
This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
|
#
40648 |
|
25-Oct-1998 |
phk |
Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
|
#
37649 |
|
15-Jul-1998 |
bde |
Cast pointers to uintptr_t/intptr_t instead of to u_long/long, respectively. Most of the longs should probably have been u_longs, but this changes is just to prevent warnings about casts between pointers and integers of different sizes, not to fix poorly chosen types.
|
#
36079 |
|
15-May-1998 |
wollman |
Convert socket structures to be type-stable and add a version number.
Define a parameter which indicates the maximum number of sockets in a system, and use this to size the zone allocators used for sockets and for certain PCBs.
Convert PF_LOCAL PCB structures to be type-stable and add a version number.
Define an external format for infomation about socket structures and use it in several places.
Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through sysctl(3) without blocking network interrupts for an unreasonable length of time. This probably still has some bugs and/or race conditions, but it seems to work well enough on my machines.
It is now possible for `netstat' to get almost all of its information via the sysctl(3) interface rather than reading kmem (changes to follow).
|
#
35823 |
|
07-May-1998 |
msmith |
In the words of the submitter:
--------- Make callers of namei() responsible for releasing references or locks instead of having the underlying filesystems do it. This eliminates redundancy in all terminal filesystems and makes it possible for stacked transport layers such as umapfs or nullfs to operate correctly.
Quality testing was done with testvn, and lat_fs from the lmbench suite.
Some NFS client testing courtesy of Patrik Kudo.
vop_mknod and vop_symlink still release the returned vpp. vop_rename still releases 4 vnode arguments before it returns. These remaining cases will be corrected in the next set of patches. ---------
Submitted by: Michael Hancock <michaelh@cet.co.jp>
|
#
35256 |
|
17-Apr-1998 |
des |
Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.
|
#
33134 |
|
06-Feb-1998 |
eivind |
Back out DIAGNOSTIC changes.
|
#
33108 |
|
04-Feb-1998 |
eivind |
Turn DIAGNOSTIC into a new-style option.
|
#
31365 |
|
23-Nov-1997 |
bde |
Fixed duplicate definitions of M_FILE (one static).
|
#
31016 |
|
07-Nov-1997 |
phk |
Remove a bunch of variables which were unused both in GENERIC and LINT.
Found by: -Wunused
|
#
30354 |
|
12-Oct-1997 |
phk |
Last major round (Unless Bruce thinks of somthing :-) of malloc changes.
Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them.
A couple of finer points by: bde
|
#
29361 |
|
14-Sep-1997 |
peter |
Various select -> poll changes
|
#
29041 |
|
02-Sep-1997 |
bde |
Removed unused #includes.
|
#
29024 |
|
01-Sep-1997 |
bde |
Added used #include - don't depend on <sys/mbuf.h> including <sys/malloc.h> (unless we only use the bogusly shared M*WAIT flags).
|
#
28270 |
|
16-Aug-1997 |
wollman |
Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
|
#
25201 |
|
27-Apr-1997 |
wollman |
The long-awaited mega-massive-network-code- cleanup. Part I.
This commit includes the following changes: 1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility glue for them is deleted, and the kernel will panic on boot if any are compiled in.
2) Certain protocol entry points are modified to take a process structure, so they they can easily tell whether or not it is possible to sleep, and also to access credentials.
3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt() call. Protocols should use the process pointer they are now passed.
4) The PF_LOCAL and PF_ROUTE families have been updated to use the new style, as has the `raw' skeleton family.
5) PF_LOCAL sockets now obey the process's umask when creating a socket in the filesystem.
As a result, LINT is now broken. I'm hoping that some enterprising hacker with a bit more time will either make the broken bits work (should be easy for netipx) or dike them out.
|
#
24131 |
|
23-Mar-1997 |
bde |
Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined. Fixed everything that depended on getting fcntl.h stuff from the wrong place. Most things don't depend on file.h stuff at all.
|
#
24083 |
|
21-Mar-1997 |
wpaul |
Add support to sendmsg()/recvmsg() for passing credentials between processes using AF_LOCAL sockets. This hack is going to be used with Secure RPC to duplicate a feature of STREAMS which has no real counterpart in sockets (with STREAMS/TLI, you can apparently use t_getinfo() to learn UID of a local process on the other side of a transport endpoint).
What happens is this: the client sets up a sendmsg() call with ancillary data using the SCM_CREDS socket-level control message type. It does not need to fill in the structure. When the kernel notices the data, unp_internalize() fills in the cmesgcred structure with the sending process' credentials (UID, EUID, GID, and ancillary groups). This data is later delivered to the receiving process. The receiver can then perform the follwing tests:
- Did the client send ancillary data? o Yes, proceed. o No, refuse to authenticate the client.
- The the client send data of type SCM_CREDS? o Yes, proceed. o No, refuse to authenticate the client.
- Is the cmsgcred structure the right size? o Yes, proceed. o No, signal a possible error.
The receiver can now inspect the credential information and use it to authenticate the client.
|
#
23081 |
|
24-Feb-1997 |
wollman |
Create a new branch of the kernel MIB, kern.ipc, to store all of the configurables and instrumentation related to inter-process communication mechanisms. Some variables, like mbuf statistics, are instrumented here for the first time.
For mbuf statistics: also keep track of m_copym() and m_pullup() failures, and provide for the user's inspection the compiled-in values of MSIZE, MHLEN, MCLBYTES, and MINCLSIZE.
|
#
22975 |
|
22-Feb-1997 |
peter |
Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
|
#
22521 |
|
10-Feb-1997 |
dyson |
This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes.
The system boots and can mount UFS filesystems.
Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed.
Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
|
#
21673 |
|
14-Jan-1997 |
jkh |
Make the long-awaited change from $Id$ to $FreeBSD$
This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
|
#
20167 |
|
05-Dec-1996 |
julian |
Add comments to hard-to-follow File descriptor handling code
|
#
14546 |
|
11-Mar-1996 |
dg |
Move or add #include <queue.h> in preparation for upcoming struct socket changes.
|
#
14496 |
|
11-Mar-1996 |
hsu |
Merge in Lite2: LIST replacement for f_filef, f_fileb, and filehead. Reviewed by: davidg & bde
|
#
12820 |
|
14-Dec-1995 |
phk |
Another mega commit to staticize things.
|
#
10484 |
|
30-Aug-1995 |
dyson |
Increase the size of the pipe buffer as denoted by PIPSIZ from 4k to 8k. This has a significant effect on the pipe performance. In the future it might be good to increase this to 16k. PIPSIZ is now tunable for experimentation.
|
#
10080 |
|
16-Aug-1995 |
bde |
Make everything except the unsupported network sources compile cleanly with -Wnested-externs.
|
#
9996 |
|
08-Aug-1995 |
dg |
Move mbuf frees to after call to sorflush().
Submitted by: Matt Dillon
|
#
8876 |
|
30-May-1995 |
rgrimes |
Remove trailing whitespace.
|
#
8426 |
|
10-May-1995 |
wollman |
Make networking domains drop-ins, through the magic of GNU ld. (Some day, there may even be LKMs.) Also, change the internal name of `unixdomain' to `localdomain' since AF_LOCAL is now the preferred name of this family. Declare netisr correctly and in the right place.
|
#
6436 |
|
15-Feb-1995 |
dg |
Fixed bug caused by attempting a connect with a null 'nam'.
|
#
6222 |
|
07-Feb-1995 |
wollman |
Merge in the socket-level support for Transaction TCP.
|
#
3308 |
|
02-Oct-1994 |
phk |
All of this is cosmetic. prototypes, #includes, printfs and so on. Makes GCC a lot more silent.
|
#
3175 |
|
28-Sep-1994 |
phk |
A potential panic, found by adding declarations.
|
#
1817 |
|
02-Aug-1994 |
dg |
Added $Id$
|
#
1549 |
|
25-May-1994 |
rgrimes |
The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.
Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
|
#
1542 |
|
24-May-1994 |
rgrimes |
This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
|
#
1541 |
|
24-May-1994 |
rgrimes |
BSD 4.4 Lite Kernel Sources
|