History log of /freebsd-11-stable/sys/kern/kern_descrip.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 369335 22-Feb-2021 jamie

MFC jail: Change both root and working directories in jail_attach(2)

jail_attach(2) performs an internal chroot operation, leaving it up to
the calling process to assure the working directory is inside the jail.

Add a matching internal chdir operation to the jail's root. Also
ignore kern.chroot_allow_open_directories, and always disallow the
operation if there are any directory descriptors open.

Reported by: mjg
Approved by: markj, kib

(cherry picked from commit d4380c0cdd0517dc038403dd5c99242ce78bdeb5)

Git Hash: 570121808a76b85b2709502fb15618dd1e5296f1
Git Author: jamie@FreeBSD.org


# 358202 21-Feb-2020 kevans

MFC r357899: u_char -> vm_prot_t in a couple of places, NFC

The latter is a typedef of the former; the typedef exists and these bits are
representing vmprot values, so use the correct type.


# 353124 05-Oct-2019 markj

MFC r353010:
Disallow fcntl(F_READAHEAD) when the vnode is not a regular file.


# 349776 06-Jul-2019 markj

MFC r349547:
Use a consistent snapshot of the fd's rights in fget_mmap().


# 333069 27-Apr-2018 jhb

MFC 332657:
Properly do a deep copy of the ioctls capability array for fget_cap().

fget_cap() tries to do a cheaper snapshot of a file descriptor without
holding the file descriptor lock. This snapshot does not do a deep
copy of the ioctls capability array, but instead uses a different
return value to inform the caller to retry the copy with the lock
held. However, filecaps_copy() was returning 1 to indicate that a
retry was required, and fget_cap() was checking for 0 (actually
'!filecaps_copy()'). As a result, fget_cap() did not do a deep copy
of the ioctls array and just reused the original pointer. This cause
multiple file descriptor entries to think they owned the same pointer
and eventually resulted in duplicate frees.

The only code path that I'm aware of that triggers this is to create a
listen socket that has a restricted list of ioctls and then call
accept() which calls fget_cap() with a valid filecaps structure from
getsock_cap().

To fix, change the return value of filecaps_copy() to return true if
it succeeds in copying the caps and false if it fails because the lock
is required. I find this more intuitive than fixing the caller in
this case. While here, change the return type from 'int' to 'bool'.

Finally, make filecaps_copy() more robust in the failure case by not
copying any of the source filecaps structure over. This avoids the
possibility of leaking a pointer into a structure if a similar future
caller doesn't properly handle the return value from filecaps_copy()
at the expense of one more branch.

I also added a test case that panics before this change and now passes.


# 331722 29-Mar-2018 eadler

Revert r330897:

This was intended to be a non-functional change. It wasn't. The commit
message was thus wrong. In addition it broke arm, and merged crypto
related code.

Revert with prejudice.

This revert skips files touched in r316370 since that commit was since
MFCed. This revert also skips files that require $FreeBSD$ property
changes.

Thank you to those who helped me get out of this mess including but not
limited to gonzo, kevans, rgrimes.

Requested by: gjb (re)


# 330897 14-Mar-2018 eadler

Partial merge of the SPDX changes

These changes are incomplete but are making it difficult
to determine what other changes can/should be merged.

No objections from: pfg


# 328298 23-Jan-2018 jhb

MFC 320900,323882,324224,324226,324228,326986,326988,326989,326990,326993,
326994,326995,327004: Various fixes for pathconf(2).

The original change to use vop_stdpathconf() more widely was motivated
by a panic due to recent AIO-related changes. However, bde@ reported
that vop_stdpathconf() contained too many settings that were not
filesystem-independent. The end result of this set of patches is to
fix the AIO-related panic via use of a trimmed-down vop_stdpathconf()
while also adding support for missing pathconf variables in various
filesystems (and removing a few settings incorrectly reported as
supported).

320900:
Consistently use vop_stdpathconf() for default pathconf values.

Update filesystems not currently using vop_stdpathconf() in pathconf
VOPs to use vop_stdpathconf() for any configuration variables that do
not have filesystem-specific values. vop_stdpathconf() is used for
variables that have system-wide settings as well as providing default
values for some values based on system limits. Filesystems can still
explicitly override individual settings.

323882:
Only handle _PC_MAX_CANON, _PC_MAX_INPUT, and _PC_VDISABLE for TTY devices.

Move handling of these three pathconf() variables out of vop_stdpathconf()
and into devfs_pathconf() as TTY devices can only be devfs files. In
addition, only return settings for these three variables for devfs devices
whose device switch has the D_TTY flag set.

324224:
Handle _PC_FILESIZEBITS and _PC_SYMLINK_MAX pathconf() requests in cd9660.

cd9660 only supports symlinks with Rock Ridge extensions, so
_PC_SYMLINK_MAX is conditional on Rock Ridge.

324226:
Return 64 for pathconf(_PC_FILESIZEBITS) on tmpfs.

324228:
Flesh out pathconf() on UDF.

- Return 64 bits for _PC_FILESIZEBITS.
- Handle _PC_SYMLINK_MAX.
- Defer _PC_PATH_MAX to vop_stdpathconf().

326986:
Add a custom VOP_PATHCONF method for fdescfs.

The method handles NAME_MAX and LINK_MAX explicitly. For all other
pathconf variables, the method passes the request down to the underlying
file descriptor. This requires splitting a kern_fpathconf() syscallsubr
routine out of sys_fpathconf(). Also, to avoid lock order reversals with
vnode locks, the fdescfs vnode is unlocked around the call to
kern_fpathconf(), but with the usecount of the vnode bumped.

326988:
Add a custom VOP_PATHCONF method for fuse.

This method handles _PC_FILESIZEBITS, _PC_SYMLINK_MAX, and _PC_NO_TRUNC.
For other values it defers to vop_stdpathconf().

326989:
Support _PC_FILESIZEBITS in msdosfs' VOP_PATHCONF().

326990:
Handle _PC_FILESIZEBITS and _PC_NO_TRUNC for smbfs' VOP_PATHCONF().

326993:
Move NAME_MAX, LINK_MAX, and CHOWN_RESTRICTED out of vop_stdpathconf().

Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations. Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to
indicate hard links aren't supported.

326994:
Handle _PC_FILESIZEBITS and _PC_SYMLINK_MAX for devfs' VOP_PATHCONF().

326995:
Use FUSE_LINK_MAX for LINK_MAX in fuse' VOP_PATHCONF().

Should have included this in r326993.

327004:
Rework pathconf handling for FIFOs.

On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.

To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method. For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.

While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value. In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.

PR: 219851
Sponsored by: Chelsio Communications


# 315396 16-Mar-2017 mjg

MFC r312980:

fd: sprinkle __read_mostly and __exclusive_cache_line


# 315372 16-Mar-2017 mjg

MFC r313260:

fd: switch fget_unlocked to atomic_fcmpse


# 315312 15-Mar-2017 dchagin

MFC r305093 (by mjg@):

fd: add fdeget_locked and use in kern_descrip

MFC r305756 (by oshogbo@):

fd: add fget_cap and fget_cap_locked primitives.
They can be used to obtain capabilities along with a referenced fp.

MFC r306174 (by oshogbo@):

capsicum: propagate rights on accept(2)

Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.

PR: 201052

MFC r306184 (by oshogbo@):

fd: simplify fgetvp_rights by using fget_cap_locked.

MFC r306225 (by mjg@):

fd: fix up fgetvp_rights after r306184

fget_cap_locked returns a referenced file, but the fgetvp_rights does
not need it. Instead, due to the filedesc lock being held, it can
ref the vnode after the file was looked up.

Fix up fget_cap_locked to be consistent with other _locked helpers and not
ref the file.

This plugs a leak introduced in r306184.

MFC r306232 (by oshogbo@):

fd: fix up fget_cap

If the kernel is not compiled with the CAPABILITIES kernel options
fget_unlocked doesn't return the sequence number so fd_modify will
always report modification, in that case we got infinity loop.

MFC r311474 (by glebius@):

Use getsock_cap() instead of fgetsock().

MFC r312079 (by glebius@):

Use getsock_cap() instead of deprecated fgetsock().

MFC r312081 (by glebius@):

Use getsock_cap() instead of deprecated fgetsock().

MFC r312087 (by glebius@):

Remove deprecated fgetsock() and fputsock().

Bump __FreeBSD_version as getsock_cap changed and
fgetsock/fputsock pair removed.


# 314334 27-Feb-2017 kib

MFC kern_mmap(9) and related helpers.

MFC r302514 (by rwatson):
Audit file-descriptor arguments to I/O system calls such as
read(2), write(2), dup(2), and mmap(2).

MFC r302524 (by rwatson):
When mmap(2) is used with a vnode, capture vnode attributes in the
audit trail.

MFC r313352 (by trasz):
Add kern_vm_mmap2(), kern_vm_mprotect(), kern_vm_msync(), kern_vm_munlock(),
kern_vm_munmap(), and kern_vm_madvise().

MFC r313655:
Change type of the prot parameter for kern_vm_mmap() from vm_prot_t to int.

MFC r313696:
Rework r313352.


# 312716 24-Jan-2017 mjg

MFC r311004:

fd: access openfiles once in falloc_noinstall

This is similar to what's done with nprocs.

Note this is only a band aid.


# 312714 24-Jan-2017 mjg

MFC r310805:

Remove cpu_spinwait after seq_consistent.

It does not add any benefit as the read routine will do it as necessary.


# 310964 31-Dec-2016 mjg

MFC r303921:

sigio: do a lockless check in funsetownlist

There is no need to grab the lock first to see if sigio is used, and it
typically is not.


# 310953 31-Dec-2016 mjg

MFC r309893,r309929:

vfs: add vrefact, to be used when the vnode has to be already active

This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

==

vfs: use vrefact in getcwd and fchdir


# 309384 01-Dec-2016 julian

MFH: r306306

Give the user a clue as to which process hit maxfiles.

Sponsored by: Panzura


# 305516 06-Sep-2016 emaste

MFC r305171: allow kern.proc.nfds sysctl in capability mode


# 302408 07-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


/freebsd-11-stable/MAINTAINERS
/freebsd-11-stable/cddl
/freebsd-11-stable/cddl/contrib/opensolaris
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/dtrace/test/tst/common/print
/freebsd-11-stable/cddl/contrib/opensolaris/cmd/zfs
/freebsd-11-stable/cddl/contrib/opensolaris/lib/libzfs
/freebsd-11-stable/contrib/amd
/freebsd-11-stable/contrib/apr
/freebsd-11-stable/contrib/apr-util
/freebsd-11-stable/contrib/atf
/freebsd-11-stable/contrib/binutils
/freebsd-11-stable/contrib/bmake
/freebsd-11-stable/contrib/byacc
/freebsd-11-stable/contrib/bzip2
/freebsd-11-stable/contrib/com_err
/freebsd-11-stable/contrib/compiler-rt
/freebsd-11-stable/contrib/dialog
/freebsd-11-stable/contrib/dma
/freebsd-11-stable/contrib/dtc
/freebsd-11-stable/contrib/ee
/freebsd-11-stable/contrib/elftoolchain
/freebsd-11-stable/contrib/elftoolchain/ar
/freebsd-11-stable/contrib/elftoolchain/brandelf
/freebsd-11-stable/contrib/elftoolchain/elfdump
/freebsd-11-stable/contrib/expat
/freebsd-11-stable/contrib/file
/freebsd-11-stable/contrib/gcc
/freebsd-11-stable/contrib/gcclibs/libgomp
/freebsd-11-stable/contrib/gdb
/freebsd-11-stable/contrib/gdtoa
/freebsd-11-stable/contrib/groff
/freebsd-11-stable/contrib/ipfilter
/freebsd-11-stable/contrib/ldns
/freebsd-11-stable/contrib/ldns-host
/freebsd-11-stable/contrib/less
/freebsd-11-stable/contrib/libarchive
/freebsd-11-stable/contrib/libarchive/cpio
/freebsd-11-stable/contrib/libarchive/libarchive
/freebsd-11-stable/contrib/libarchive/libarchive_fe
/freebsd-11-stable/contrib/libarchive/tar
/freebsd-11-stable/contrib/libc++
/freebsd-11-stable/contrib/libc-vis
/freebsd-11-stable/contrib/libcxxrt
/freebsd-11-stable/contrib/libexecinfo
/freebsd-11-stable/contrib/libpcap
/freebsd-11-stable/contrib/libstdc++
/freebsd-11-stable/contrib/libucl
/freebsd-11-stable/contrib/libxo
/freebsd-11-stable/contrib/llvm
/freebsd-11-stable/contrib/llvm/projects/libunwind
/freebsd-11-stable/contrib/llvm/tools/clang
/freebsd-11-stable/contrib/llvm/tools/lldb
/freebsd-11-stable/contrib/llvm/tools/llvm-dwarfdump
/freebsd-11-stable/contrib/llvm/tools/llvm-lto
/freebsd-11-stable/contrib/mdocml
/freebsd-11-stable/contrib/mtree
/freebsd-11-stable/contrib/ncurses
/freebsd-11-stable/contrib/netcat
/freebsd-11-stable/contrib/ntp
/freebsd-11-stable/contrib/nvi
/freebsd-11-stable/contrib/one-true-awk
/freebsd-11-stable/contrib/openbsm
/freebsd-11-stable/contrib/openpam
/freebsd-11-stable/contrib/openresolv
/freebsd-11-stable/contrib/pf
/freebsd-11-stable/contrib/sendmail
/freebsd-11-stable/contrib/serf
/freebsd-11-stable/contrib/sqlite3
/freebsd-11-stable/contrib/subversion
/freebsd-11-stable/contrib/tcpdump
/freebsd-11-stable/contrib/tcsh
/freebsd-11-stable/contrib/tnftp
/freebsd-11-stable/contrib/top
/freebsd-11-stable/contrib/top/install-sh
/freebsd-11-stable/contrib/tzcode/stdtime
/freebsd-11-stable/contrib/tzcode/zic
/freebsd-11-stable/contrib/tzdata
/freebsd-11-stable/contrib/unbound
/freebsd-11-stable/contrib/vis
/freebsd-11-stable/contrib/wpa
/freebsd-11-stable/contrib/xz
/freebsd-11-stable/crypto/heimdal
/freebsd-11-stable/crypto/openssh
/freebsd-11-stable/crypto/openssl
/freebsd-11-stable/gnu/lib
/freebsd-11-stable/gnu/usr.bin/binutils
/freebsd-11-stable/gnu/usr.bin/cc/cc_tools
/freebsd-11-stable/gnu/usr.bin/gdb
/freebsd-11-stable/lib/libc/locale/ascii.c
/freebsd-11-stable/sys/cddl/contrib/opensolaris
/freebsd-11-stable/sys/contrib/dev/acpica
/freebsd-11-stable/sys/contrib/ipfilter
/freebsd-11-stable/sys/contrib/libfdt
/freebsd-11-stable/sys/contrib/octeon-sdk
/freebsd-11-stable/sys/contrib/x86emu
/freebsd-11-stable/sys/contrib/xz-embedded
/freebsd-11-stable/usr.sbin/bhyve/atkbdc.h
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.c
/freebsd-11-stable/usr.sbin/bhyve/bhyvegc.h
/freebsd-11-stable/usr.sbin/bhyve/console.c
/freebsd-11-stable/usr.sbin/bhyve/console.h
/freebsd-11-stable/usr.sbin/bhyve/pci_fbuf.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.c
/freebsd-11-stable/usr.sbin/bhyve/pci_xhci.h
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.c
/freebsd-11-stable/usr.sbin/bhyve/ps2kbd.h
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.c
/freebsd-11-stable/usr.sbin/bhyve/ps2mouse.h
/freebsd-11-stable/usr.sbin/bhyve/rfb.c
/freebsd-11-stable/usr.sbin/bhyve/rfb.h
/freebsd-11-stable/usr.sbin/bhyve/sockstream.c
/freebsd-11-stable/usr.sbin/bhyve/sockstream.h
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.c
/freebsd-11-stable/usr.sbin/bhyve/usb_emul.h
/freebsd-11-stable/usr.sbin/bhyve/usb_mouse.c
/freebsd-11-stable/usr.sbin/bhyve/vga.c
/freebsd-11-stable/usr.sbin/bhyve/vga.h
# 300852 27-May-2016 mjg

fd: provide a common exit point for unlock in kern_dup

While here assert dropped filedesc lock on return from closefp.


# 299227 08-May-2016 mjg

fd: assert dropped filedesc lock in fdcloseexec


# 298819 29-Apr-2016 pfg

sys/kern: spelling fixes in comments.

No functional change.


# 297400 29-Mar-2016 glebius

The sendfile(2) allows to send extra data from userspace before the file
data (headers). Historically the size of the headers was not checked
against the socket buffer space. Application could easily overcommit the
socket buffer space.

With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space. In case when size of headers is bigger that socket space,
KASSERT fires. Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.

o With this change, the headers copyin is moved down into the cycle, after
the sbspace() check. The uio size is trimmed by socket space there,
which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
up the stack to syscall wrappers. This required a copy and paste of the
code, but in turn this allowed to remove extra stack carried parameter
from fo_sendfile_t, and embrace entire compat code into #ifdef. If in
future we got more fo_sendfile_t function, the copy and paste level would
even reduce.

Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by: Vitalij Satanivskij <satan ukr.net>
Sponsored by: Netflix


# 296572 09-Mar-2016 jhb

Simplify AIO initialization now that it is standard.

- Mark AIO system calls as STD and remove the helpers to dynamically
register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
the pathconf() system call. fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5589


# 290473 06-Nov-2015 mjg

fd: implement kern.proc.nfds sysctl

Intended purpose is to provide an equivalent of OpenBSD's getdtablecount
syscall for the compat library..


# 287540 07-Sep-2015 mjg

fd: make rights a mandatory argument to fgetvp_rights

The only caller already always passes rights.


# 287539 07-Sep-2015 mjg

fd: make the common case in filecaps_copy work lockless

The filedesc lock is only needed if ioctls caps are present, which is a
rare situation. This is a step towards reducing the scope of the filedesc
lock.


# 287442 03-Sep-2015 cem

Detect badly behaved coredump note helpers

Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.

When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.

NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath. As vnodes may move around during dump, this is racy.

So:

- Detect badly behaved notes in putnote() and pad underfilled notes.

- Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
exercise the NT_PROCSTAT_FILES corruption. It simply picks random
lengths to expand or truncate paths to in fo_fill_kinfo_vnode().

- Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
disable kinfo packing for PROCSTAT_FILES notes. This should avoid
both FILES note corruption and truncation, even if filenames change,
at the cost of about 1 kiB in padding bloat per open fd. Document
the new sysctl in core.5.

- Fix note_procstat_files to self-limit in the 2nd pass. Since
sometimes this will result in a short write, pad up to our advertised
size. This addresses note corruption, at the risk of sometimes
truncating the last several fd info entries.

- Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
zero padding.

With suggestions from: bjk, jhb, kib, wblock
Approved by: markj (mentor)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3548


# 287416 02-Sep-2015 mjg

fd: remove UMA_ZONE_ZINIT argument from Files zone

Originally it was added in order to prevent trashing of objects with
INVARIANTS enabled. The same effect is now provided with mere UMA_ZONE_NOFREE.

This reverts r286921.

Discussed with: kib


# 286921 19-Aug-2015 kib

fget_unlocked() depends on the freed struct file f_count field being
zero. The file_zone if no-free, but r284861 added trashing of the
freed memory. Most visible manifestation of the issue were 'memory
modified after free' panics for the file zone, triggered from
falloc_noinstall().

Add UMA_ZONE_ZINIT flag to turn off trashing. Mjg noted that it makes
sense to not trash freed memory for any non-free zone, which will be
done later.

Reported and tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation


# 286020 29-Jul-2015 ed

Introduce falloc_caps() to create descriptors with capabilties in place.

falloc_noinstall() followed by finstall() allows you to create and
install file descriptors with custom capabilities. Add falloc_caps()
that can do both of these actions in one go.

This will be used by CloudABI to create pipes with custom capabilities.

Reviewed by: mjg


# 285636 16-Jul-2015 mjg

fd: partially deduplicate fdescfree and fdescfree_remapped

This also moves vrele of cdir/rdir/jdir vnodes earlier, which should not
matter.


# 285622 16-Jul-2015 ed

Implement CloudABI's exec() call.

Summary:
In a runtime that is purely based on capability-based security, there is
a strong emphasis on how programs start their execution. We need to make
sure that we execute an new program with an exact set of file
descriptors, ensuring that credentials are not leaked into the process
accidentally.

Providing the right file descriptors is just half the problem. There
also needs to be a framework in place that gives meaning to these file
descriptors. How does a CloudABI mail server know which of the file
descriptors corresponds to the socket that receives incoming emails?
Furthermore, how will this mail server acquire its configuration
parameters, as it cannot open a configuration file from a global path on
disk?

CloudABI solves this problem by replacing traditional string command
line arguments by tree-like data structure consisting of scalars,
sequences and mappings (similar to YAML/JSON). In this structure, file
descriptors are treated as a first-class citizen. When calling exec(),
file descriptors are passed on to the new executable if and only if they
are referenced from this tree structure. See the cloudabi-run(1) man
page for more details and examples (sysutils/cloudabi-utils).

Fortunately, the kernel does not need to care about this tree structure
at all. The C library is responsible for serializing and deserializing,
but also for extracting the list of referenced file descriptors. The
system call only receives a copy of the serialized data and a layout of
what the new file descriptor table should look like:

int proc_exec(int execfd, const void *data, size_t datalen, const int *fds,
size_t fdslen);

This change introduces a set of fd*_remapped() functions:

- fdcopy_remapped() pulls a copy of a file descriptor table, remapping
all of the file descriptors according to the provided mapping table.
- fdinstall_remapped() replaces the file descriptor table of the process
by the copy created by fdcopy_remapped().
- fdescfree_remapped() frees the table in case we aborted before
fdinstall_remapped().

We then add a function exec_copyin_data_fds() that builds on top these
functions. It copies in the data and constructs a new remapped file
descriptor. This is used by cloudabi_sys_proc_exec().

Test Plan:
cloudabi-run(1) is capable of spawning processes successfully, providing
it data and file descriptors. procstat -f seems to confirm all is good.
Regular FreeBSD processes also work properly.

Reviewers: kib, mjg

Reviewed By: mjg

Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D3079


# 285391 11-Jul-2015 mjg

Create a dedicated function for ensuring that cdir and rdir are populated.

Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.

Reviewed by: kib


# 285390 11-Jul-2015 mjg

Move chdir/chroot-related fdp manipulation to kern_descrip.c

Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by: kib


# 285357 10-Jul-2015 mjg

fd: further cleanup of kern_dup

- make mode enum start from 0 so that the assertion covers all cases [1]
- rename prefix _CLOEXEC flag with _FLAG
- postpone fhold on the old file descriptor, which eliminates the need to fdrop
in error cases.
- fixup FDDUP_FCNTL check missed in the previous commit

This removes 'fp == oldfde->fde_file' assertion which had little value. kern_dup
only calls fd-related functions which cannot drop the lock or a whole lot of
races would be introduced.

Noted by: kib [1]


# 285356 10-Jul-2015 mjg

fd: split kern_dup flags argument into actual flags and a mode

Tidy up the code inside to switch on the mode.


# 285323 09-Jul-2015 ed

Add implementations for some of the CloudABI file descriptor system calls.

All of the CloudABI system calls that operate on file descriptors of an
arbitrary type are prefixed with fd_. This change adds wrappers for
most of these system calls around their FreeBSD equivalents.

The dup2() system call present on CloudABI deviates from POSIX, in the
sense that it can only be used to replace existing file descriptor. It
cannot be used to create new ones. The reason for this is that this is
inherently thread-unsafe. Furthermore, there is no need on CloudABI to
use fixed file descriptor numbers. File descriptors 0, 1 and 2 have no
special meaning.

This change exposes the kern_dup() through <sys/syscallsubr.h> and puts
the FDDUP_* flags in <sys/filedesc.h>. It then adds a new flag,
FDDUP_MUSTREPLACE to force that file descriptors are replaced -- not
allocated.

Differential Revision: https://reviews.freebsd.org/D3035
Reviewed by: mjg


# 285321 09-Jul-2015 mjg

fd: prepare do_dup for being exported

- rename it to kern_dup.
- prefix flags with FD
- assert that correct flags were passed


# 285269 08-Jul-2015 kib

Handle copyout for the fcntl(F_OGETLK) using oflock structure.
Otherwise, kernel overwrites a word past the destination.

Submitted by: walter@pelissero.de
PR: 196718
MFC after: 1 week


# 285172 05-Jul-2015 mjg

fd: make 'rights' a manadatory argument to fget* functions


# 285134 04-Jul-2015 mjg

fd: de-k&r-ify functions + some whitespace fixes

No functional changes.


# 284443 16-Jun-2015 mjg

fd: make rights a mandatory argument to fget_unlocked


# 284442 16-Jun-2015 mjg

fd: don't unnecessary copy capabilities in _fget


# 284381 14-Jun-2015 mjg

fd: reduce excessive zeroing on fd close

fde_file as NULL is already an indicator of an unused fd. All other
fields are populated when fp is installed.


# 284380 14-Jun-2015 mjg

fd: move out actual fp installation to _finstall

Use it in fd passing functions as the first step towards fd code cleanup.


# 284217 10-Jun-2015 mjg

Fixup the build after r284215.

Submitted by: Ivan Klymenko <fidaj ukr.net> [slighly modified]


# 284215 10-Jun-2015 mjg

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 284212 10-Jun-2015 mjg

fd: remove fdesc_mtx


# 284211 10-Jun-2015 mjg

fd: use atomics to manage fd_refcnt and fd_holcnt

This gets rid of fdesc_mtx.


# 283059 18-May-2015 mjg

fd: fix imbalanced fdp unlock in F_SETLK and F_GETLK

MFC after: 3 days


# 282213 29-Apr-2015 trasz

Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation


# 282033 26-Apr-2015 mjg

fd: plug an always overwritten initialization in fdalloc


# 281436 11-Apr-2015 mjg

fd: remove filedesc argument from fdclose

Just accept a thread instead. This makes it consistent with fdalloc.

No functional changes.


# 280407 23-Mar-2015 mjg

filedesc: microoptimize fget_unlocked by getting rid of fd < 0 branch

Casting fd to an unsigned type simplifies fd range coparison to mere checking
if the result is bigger than the table.


# 279993 14-Mar-2015 ian

Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl
strings returned to userland include the nulterm byte.

Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases. (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)

Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.

PR: 195668


# 278956 18-Feb-2015 mjg

filedesc: obtain a stable copy of credentials in fget_unlocked

This was broken in r278930.

While here tidy up fget_mmap to use fdp from local var instead of obtaining
the same pointer from td.


# 278930 17-Feb-2015 mjg

filedesc: simplify fget_unlocked & friends

Introduce fget_fcntl which performs appropriate checks when needed.
This removes a branch from fget_unlocked.

Introduce fget_mmap dealing with cap_rights_to_vmprot conversion.
This removes a branch from _fget.

Modify fget_unlocked to pass sequence counter to interested callers so
that they can perform their own checks and make sure the result was
otained from stable & current state.

Reviewed by: silence on -hackers


# 277496 21-Jan-2015 mjg

filedesc: avoid spurious copying of capabilities in fget_unlocked

We obtain a stable copy and store it in local 'fde' variable. Storing another
copy (based on aforementioned variable) does not serve any purpose.

No functional changes.


# 277495 21-Jan-2015 mjg

filedesc: return 0 from badfo_close

The only potential in-tree consumer (_fdrop) special-cased it and returns 0
0 on its own instead of calling badfo_close.

Remove the special case since it is not needed and very unlikely to encounter
anyway.

No objections from: kib


# 277494 21-Jan-2015 mjg

filedesc: fix whitespace nits in fget and fget_read

No functional changes.


# 277461 20-Jan-2015 mjg

filedesc: plug a test for impossible condition in _fget


# 274970 24-Nov-2014 jhb

Properly initialize the capability rights for vnodes exported to procstat
that aren't for file descriptors (cwd, jdir, tracevp, etc.).

Submitted by: Mikhail <mp@lenta.ru>


# 274904 22-Nov-2014 mjg

filedesc: plug a test for impossible condition in fgetvp_rights


# 274482 13-Nov-2014 mjg

filedesc: fixup fdinit to lock fdp and preapare files conditinally

Not all consumers providing fdp to copy from want files.

Perhaps these functions should be reorganized to better express the outcome.

This fixes up panics after r273895 .

Reported by: markj


# 274476 13-Nov-2014 kib

Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 274403 11-Nov-2014 glebius

Remove SF_KQUEUE code. This code was developed at Netflix, but was not
ever used. It didn't go into stable/10, neither was documented.
It might be useful, but we collectively decided to remove it, rather
leave it abandoned and unmaintained. It is removed in one single
commit, so restoring it should be easy, if anyone wants to reopen
this idea.

Sponsored by: Netflix


# 274167 06-Nov-2014 mjg

Add sysctl kern.proc.cwd

It returns only current working directory of given process which saves a lot of
overhead over kern.proc.filedesc if given proc has a lot of open fds.

Submitted by: Tiwei Bie <btw mail.ustc.edu.cn> (slightly modified)
X-Additional: JuniorJobs project


# 274166 06-Nov-2014 mjg

filedesc: avoid taking fdesc_mtx when not necessary in fddrop

No functional changes.


# 274165 06-Nov-2014 mjg

filedesc: just free old tables without altering the list which is freed anyway

No functional changes.


# 274012 03-Nov-2014 mjg

filedesc: plus sys/kdb.h include which crept in with r274007


# 274007 03-Nov-2014 mjg

filedesc: plug unnecessary fdp NULL checks in fdescfreee and fdcopy

Anything reaching these functions has fd table.


# 274005 03-Nov-2014 mjg

filedesc: create a dedicated zone for struct filedesc0

Currently sizeof(struct filedesc0) is 1096 bytes, which means allocations from
malloc use 2048 bytes.

There is no easy way to shrink the structure <= 1024 an it is likely to grow in
the future.


# 273970 02-Nov-2014 mjg

filedesc: move freeing old tables to fdescfree

They cannot be accessed by anyone and hold count only protects the structure
from being freed.


# 273968 02-Nov-2014 mjg

filedesc: factor out some code out of fdescfree

Previously it had a huge self-contained chunk dedicated to dealing with shared
tables.

No functional changes.


# 273959 02-Nov-2014 mjg

filedesc: tidy up fdcheckstd

No functional changes.


# 273956 02-Nov-2014 mjg

filedesc: lock filedesc lock in fdcloseexec only when needed


# 273901 31-Oct-2014 mjg

filedesc: drop retval argument from do_dup

It was almost always td_retval anyway.

For the one case where it is not, preserve the old value across the call.


# 273897 31-Oct-2014 mjg

filedesc: fix missed comments about fdsetugidsafety

While here just note that both fdsetugidsafety and fdcheckstd take sleepable
locks.


# 273895 31-Oct-2014 mjg

filedesc: make fdinit return with source filedesc locked and new one sized
appropriately

Assert FILEDESC_XLOCK_ASSERT only for already used tables in fdgrowtable.
We don't have to call it with the lock held if we are just creating new
filedesc.

As a side note, strictly speaking processes can have fdtables with
fd_lastfile = -1, but then they cannot enter fdgrowtable. Very first file
descriptor they get will be 0 and the only syscall allowing to choose fd number
requires an active file descriptor. Should this ever change, we can add an 'init'
(or similar) parameter to fdgrowtable.


# 273894 31-Oct-2014 mjg

filedesc: iterate over fd table only once in fdcopy

While here add 'fdused_init' which does not perform unnecessary work.

Drop FILEDESC_LOCK_ASSERT from fdisused and rely on callers to hold
it when appropriate. This function is only used with INVARIANTS.

No functional changes intended.


# 273893 31-Oct-2014 mjg

filedesc: tidy up fdfree

Implement fdefree_last variant and get rid of 'last' parameter.

No functional changes.


# 273878 31-Oct-2014 mjg

filedesc: tidy up fdcopy a little bit

Test for file availability by fde_file != NULL instead of fdisused, this is
consistent with similar checks later.

Drop badfileops check. badfileops don't have DFLAG_PASSABLE set, so it was never
reached in practice.

fdiused is now only used in some KASSERTS, so ifdef it under INVARIANTS.

No functional changes.


# 273845 30-Oct-2014 mjg

filedesc: make sure to force table reload in fget_unlocked when count == 0

This is a fixup to r273843.


# 273843 30-Oct-2014 mjg

filedesc: microoptimize fget_unlocked by retrying obtaining reference count
without restarting whole lookup

Restart is only needed when fp was closed by current process, which is a much
rarer event than ref/deref by some other thread.


# 273842 30-Oct-2014 mjg

filedesc: get rid of atomic_load_acq_int from fget_unlocked

A read barrier was necessary because fd table pointer and table size were
updated separately, opening a window where fget_unlocked could read new size
and old pointer.

This patch puts both these fields into one dedicated structure, pointer to which
is later atomically updated. As such, fget_unlocked only needs data a dependency
barrier which is a noop on all supported architectures.

Reviewed by: kib (previous version)
MFC after: 2 weeks


# 273458 22-Oct-2014 mjg

filedesc assert that table size is at least 3 in fdsetugidsafety

Requested by: kib


# 273441 21-Oct-2014 mjg

filedesc: cleanup setugidsafety a little

Rename it to fdsetugidsafety for consistency with other functions.

There is no need to take filedesc lock if not closing any files.

The loop has to verify each file and we are guaranteed fdtable has space
for at least 20 fds. As such there is no need to check fd_lastfile.

While here tidy up is_unsafe.


# 273377 21-Oct-2014 hselasky

Fix multiple incorrect SYSCTL arguments in the kernel:

- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after: 3 days
Sponsored by: Mellanox Technologies


# 273344 20-Oct-2014 mjg

filedesc: plug 2 write-only variables

Reported by: Coverity
CID: 1245745, 1245746


# 273111 14-Oct-2014 mjg

filedesc: plug 2 assignments to M_ZERO-ed pointers in falloc_noinstall

No functional changes.


# 272567 05-Oct-2014 mjg

filedesc: fix up breakage introduced in 272505

Include sequence counter supports incoditionally [1]. This fixes reprted build
problems with e.g. nvidia driver due to missing opt_capsicum.h.

Replace fishy looking sizeof with offsetof. Make fde_seq the last member in
order to simplify calculations.

Suggested by: kib [1]
X-MFC: with 272505


# 272566 05-Oct-2014 kib

On error, sbuf_bcat() returns -1. Some callers returned this -1 to
the upper layers, which interpret it as errno value, which happens to
be ERESTART. The result was spurious restarts of the sysctls in loop,
e.g. kern.proc.proc, instead of returning ENOMEM to caller.

Convert -1 from sbuf_bcat() to ENOMEM, when returning to the callers
expecting errno.

In collaboration with: pho
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week


# 272505 04-Oct-2014 mjg

Plug capability races.

fp and appropriate capability lookups were not atomic, which could result in
improper capabilities being checked.

This could result either in protection bypass or in a spurious ENOTCAPABLE.

Make fp + capability check atomic with the help of sequence counters.

Reviewed by: kib
MFC after: 3 weeks


# 272185 26-Sep-2014 mjg

Make do_dup() static and move relevant macros to kern_descrip.c

No functional changes.


# 272132 25-Sep-2014 kib

Fix fcntl(2) compat32 after r270691. The copyin and copyout of the
struct flock are done in the sys_fcntl(), which mean that compat32 used
direct access to userland pointers.

Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which
performs neccessary userland memory accesses, and use it from both
native and compat32 fcntl syscalls.

Reported by: jhibbits
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 271976 22-Sep-2014 jhb

Add a new fo_fill_kinfo fileops method to add type-specific information to
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
for each file and then convert that to a kinfo_ofile structure rather than
keeping a second, different set of code that directly manipulates
type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.

Differential Revision: https://reviews.freebsd.org/D775
Reviewed by: kib, glebius (earlier version)


# 271489 12-Sep-2014 jhb

Fix various issues with invalid file operations:
- Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(),
and invfo_kqfilter() for use by file types that do not support the
respective operations. Home-grown versions of invfo_poll() were
universally broken (they returned an errno value, invfo_poll()
uses poll_no_poll() to return an appropriate event mask). Home-grown
ioctl routines also tended to return an incorrect errno (invfo_ioctl
returns ENOTTY).
- Use the invfo_*() functions instead of local versions for
unsupported file operations.
- Reorder fileops members to match the order in the structure definition
to make it easier to spot missing members.
- Add several missing methods to linuxfileops used by the OFED shim
layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most
of these used invfo_*(), but a dummy fo_stat() implementation was
added.


# 271486 12-Sep-2014 jhb

Simplify vntype_to_kinfo() by returning when the desired value is found
instead of breaking out of the loop and then immediately checking the loop
index so that if it was broken out of the proper value can be returned.

While here, use nitems().


# 271183 05-Sep-2014 mjg

Plug unnecessary fp assignments in kern_fcntl.

No functional changes.


# 270664 26-Aug-2014 glebius

- Remove socket file operations declaration from sys/file.h.
- Make them static in sys_socket.c.
- Provide generic invfo_truncate() instead of soo_truncate().

Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# 270648 26-Aug-2014 mjg

Fix up races with f_seqcount handling.

It was possible that the kernel would overwrite user-supplied hint.

Abuse vnode lock for this purpose.

In collaboration with: kib
MFC after: 1 week


# 269023 23-Jul-2014 mjg

Prepare fget_unlocked for reading fd table only once.

Some capsicum functions accept fdp + fd and lookup fde based on that.
Add variants which accept fde.

Reviewed by: pjd
MFC after: 1 week


# 268507 10-Jul-2014 mjg

Don't zero fd_nfiles during fdp destruction.

Code trying to take a look has to check fd_refcnt and it is 0 by that time.

This is a follow up to r268505, without this the code would leak memory for
tables bigger than the default.

MFC after: 1 week


# 268505 10-Jul-2014 mjg

Avoid relocking filedesc lock when closing fds during fdp destruction.

Don't call bzero nor fdunused from fdfree for such cases. It would do
unnecessary work and complain that the lock is not taken.

MFC after: 1 week


# 268001 28-Jun-2014 mjg

Make fdunshare accept only td parameter.

Proc had to match the thread anyway and 2 parameters were inconsistent
with the rest.

MFC after: 1 week


# 268000 28-Jun-2014 mjg

Make sure to always clear p_fd for process getting rid of its filetable.

Filetable can be shared with other processes. Previous code failed to
clear the pointer for all but the last process getting rid of the table.
This is mostly cosmetics.

Get rid of 'This should happen earlier' comment. Clearing the pointer in
this place is fine as consumers can reliably check for files availability
by inspecting fd_refcnt and vnodes availabity by NULL-checking them.

MFC after: 1 week


# 267760 22-Jun-2014 mjg

Tidy up fd-related functions called by do_execve

o assert in each one that fdp is not shared
o remove unnecessary NULL checks - all userspace processes have fdtables
and kernel processes cannot execve
o remove comments about the danger of fd_ofiles getting reallocated - fdtable
is not shared and fd_ofiles could be only reallocated if new fd was about to be
added, but if that was possible the code would already be buggy as setugidsafety
work could be undone

MFC after: 1 week


# 267755 22-Jun-2014 mjg

Don't take filedesc lock in fdunshare().

We can read refcnt safely and only care if it is equal to 1.

If it could suddenly change from 1 to something bigger the code would be
buggy even in the previous form and transitions from > 1 to 1 are equally racy
and harmless (we copy even though there is no need).

MFC after: 1 week


# 267710 21-Jun-2014 mjg

fd: replace fd_nfiles with fd_lastfile where appropriate

fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.

Few places are left unpatched for now.

MFC after: 1 week


# 267708 21-Jun-2014 mjg

do_dup: plug redundant adjustment of fd_lastfile

By that time it was already set by fdalloc, or was there in the first place
if fd is replaced.

MFC after: 1 week


# 265247 02-May-2014 mjg

Request a non-exiting process in sysctl_kern_proc_{o,}filedesc

This fixes a race with exit1 freeing p_textvp.

Suggested by: kib
MFC after: 1 week


# 264104 04-Apr-2014 mjg

Garbage collect fdavail.

It rarely returns an error and fdallocn handles the failure of fdalloc
just fine.


# 263530 21-Mar-2014 mjg

Mark the following sysctls as MPSAFE:
kern.file
kern.proc.filedesc
kern.proc.ofiledesc

MFC after: 7 days


# 263460 20-Mar-2014 mjg

Take filedesc lock only for reading when allocating new fdtable.

Code populating the table does this already.

MFC after: 1 week


# 263233 16-Mar-2014 rwatson

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# 262328 22-Feb-2014 bdrewery

Fix style of comment blocks.

Reported by: peter
Approved by: bapt (mentor, implicit)
X-MFC with: r262006


# 262309 21-Feb-2014 mjg

Fix a race between kern_proc_{o,}filedesc_out and fdescfree leading
to use-after-free.

fdescfree proceeds to free file pointers once fd_refcnt reaches 0, but
kern_proc_{o,}filedesc_out only checked for hold count.

MFC after: 3 days


# 262006 16-Feb-2014 bdrewery

Fix M_FILEDESC leak in fdgrowtable() introduced in r244510.

fdgrowtable() now only reallocates fd_map when necessary.

This fixes fdgrowtable() to use the same logic as fdescfree() for
when to free the fd_map. The logic in fdescfree() is intended to
not free the initial static allocation, however the fd_map grows
at a slower rate than the table does. The table is intended to hold
20 fd, but its initial map has many more slots than 20. The slot
sizing causes NDSLOTS(20) through NDSLOTS(63) to be 1 which matches
NDSLOTS(20), so fdescfree() was assuming that the fd_map was still
the initial allocation and not freeing it.

This partially reverts r244510 by reintroducing some of the logic
it removed in fdgrowtable().

Reviewed by: mjg
Approved by: bapt (mentor)
MFC after: 2 weeks


# 262005 16-Feb-2014 bdrewery

Remove redundant memcpy of fd_ofiles in fdgrowtable() added in r247602

Discussed with: mjg
Approved by: bapt (mentor)
MFC after: 2 weeks


# 260233 03-Jan-2014 mjg

Plug a memory leak in dup2 when both old and new fd have ioctl caps.

Reviewed by: pjd
MFC after: 3 days


# 260232 03-Jan-2014 mjg

Don't check for fd limits in fdgrowtable_exp.

Callers do that already and additional check races with process
decreasing limits and can result in not growing the table at all, which
is currently not handled.

MFC after: 3 days


# 258788 01-Dec-2013 adrian

Migrate the sendfile_sync structure into a public(ish) API in preparation
for extending and reusing it.

The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper,
used to indicate that the backing store for a group of mbufs has completed.
It's only being used by sendfile for now and it's only implementing a
sleep/wakeup rendezvous. However, there are other potential signaling
paths (kqueue) and other potential uses (socket zero-copy write) where the
same mechanism would also be useful.

So, with that in mind:

* extract the sendfile_sync code out into sf_sync_*() methods
* teach the sf_sync_alloc method about the current config flag -
it will eventually know about kqueue.
* move the sendfile_sync code out of do_sendfile() - the only thing
it now knows about is the sfs pointer. The guts of the sync
rendezvous (setup, rendezvous/wait, free) is now done in the
syscall wrapper.
* .. and teach the 32-bit compat sendfile call the same.

This should be a no-op. It's primarily preparation work for teaching
the sendfile_sync about kqueue notification.

Tested:

* Peter Holm's sendfile stress / regression scripts

Sponsored by: Netflix, Inc.


# 258768 30-Nov-2013 pjd

Make process descriptors standard part of the kernel. rwhod(8) already
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.

MFC after: 3 days


# 256210 09-Oct-2013 kib

When growing the file descriptor table, new larger memory chunk is
allocated, but the old table is kept around to handle the case of
threads still performing unlocked accesses to it.

Grow the table exponentially instead of increasing its size by
sizeof(long) * 8 chunks when overflowing. This mode significantly
reduces the total memory use for the processes consuming large numbers
of the file descriptors which open them one by one.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)


# 256209 09-Oct-2013 kib

Reduce code duplication, introduce the getmaxfd() helper to calculate
the max filedescriptor index.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)


# 255892 26-Sep-2013 jmg

it must be the last member, not might...

Reviewed by: attilio
Approved by: re (delphij, gjb)


# 255868 25-Sep-2013 attilio

Avoid memory accesses reordering which can result in fget_unlocked()
seeing a stale fd_ofiles table once fd_nfiles is already updated,
resulting in OOB accesses.

Approved by: re (kib)
Sponsored by: EMC / Isilon storage division
Reported and tested by: pho
Reviewed by: benno


# 255240 05-Sep-2013 pjd

Handle cases where capability rights are not provided.

Reported by: kib


# 255219 04-Sep-2013 pjd

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 254356 15-Aug-2013 glebius

Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# 252436 01-Jul-2013 trociny

Plug up the lock lock leakage when exporting to a short buffer.

Reported by: Alexander Leidinger
Submitted by: mjg
MFC after: 1 week


# 252351 28-Jun-2013 mjg

Remove duplicate NULL check in kern_proc_filedesc_out.
No functional changes.

MFC after: 1 week


# 252349 28-Jun-2013 trociny

Rework r252313:

The filedesc lock may not be dropped unconditionally before exporting
fd to sbuf: fd might go away during execution. While it is ok for
DTYPE_VNODE and DTYPE_FIFO because the export is from a vrefed vnode
here, for other types it is unsafe.

Instead, drop the lock in export_fd_to_sb(), after preparing data in
memory and before writing to sbuf.

Spotted by: mjg
Suggested by: kib
Review by: kib
MFC after: 1 week


# 252313 27-Jun-2013 trociny

To avoid LOR, always drop the filedesc lock before exporting fd to sbuf.

Reviewed by: kib
MFC after: 3 days


# 250223 03-May-2013 jhb

Similar to 233760 and 236717, export some more useful info about the
kernel-based POSIX semaphore descriptors to userland via procstat(1) and
fstat(1):
- Change sem file descriptors to track the pathname they are associated
with and add a ksem_info() method to copy the path out to a
caller-supplied buffer.
- Use the fo_stat() method of shared memory objects and ksem_info() to
export the path, mode, and value of a semaphore via struct kinfo_file.
- Add a struct semstat to the libprocstat(3) interface along with a
procstat_get_sem_info() to export the mode and value of a semaphore.
- Teach fstat about semaphores and to display their path, mode, and value.

MFC after: 2 weeks


# 249487 14-Apr-2013 trociny

Re-factor the code to provide kern_proc_filedesc_out(), kern_proc_out(),
and kern_proc_vmmap_out() functions to output process kinfo structures
to sbuf, to make the code reusable.

The functions are going to be used in the coredump routine to store
procstat info in the core program header notes.

Reviewed by: kib
MFC after: 3 weeks


# 249480 14-Apr-2013 mjg

Add fdallocn function and use it when passing fds over unix socket.

This gets rid of "unp_externalize fdalloc failed" panic.

Reviewed by: pjd
MFC after: 1 week


# 249240 07-Apr-2013 trociny

Use pget(9) to reduce code duplication.

MFC after: 1 week


# 247737 03-Mar-2013 pjd

Use dedicated malloc type for filecaps-related data, so we can detect any
memory leaks easier.


# 247736 03-Mar-2013 pjd

Plug memory leaks in file descriptors passing.


# 247602 01-Mar-2013 pjd

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# 247284 25-Feb-2013 pjd

Style.

Suggested by: kib


# 247283 25-Feb-2013 pjd

After r237012, the fdgrowtable() doesn't drop the filedesc lock anymore,
so update a stale comment.

Reviewed by: kib, keramida


# 246905 17-Feb-2013 pjd

Don't treat pointers as booleans.


# 246763 13-Feb-2013 ian

Make the F_READAHEAD option to fcntl(2) work as documented: a value of zero
now disables read-ahead. It used to effectively restore the system default
readahead hueristic if it had been changed; a negative value now restores
the default.

Reviewed by: kib


# 246171 31-Jan-2013 pjd

Remove label that was accidentally moved during Giant removal from VFS.


# 244510 20-Dec-2012 des

Rewrite fdgrowtable() so common mortals can actually understand what
it does and how, and add comments describing the data structures and
explaining how they are managed.


# 241896 22-Oct-2012 kib

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 238834 27-Jul-2012 kib

Add F_DUP2FD_CLOEXEC. Apparently Solaris 11 already did this.

Submitted by: Jukka A. Ukkonen <jau iki fi>
PR: standards/169962
MFC after: 1 week


# 238667 21-Jul-2012 kib

(Incomplete) fixes for symbols visibility issues and style in fcntl.h.

Append '__' prefix to the tag of struct oflock, and put it under BSD
namespace. Structure is needed both by libc and kernel, thus cannot be
hidden under #ifdef _KERNEL.

Move a set of non-standard F_* and O_* constants into BSD namespace.
SUSv4 explicitely allows implemenation to pollute F_* and O_* names
after fcntl.h is included, but it costs us nothing to adhere
to the specification if exact POSIX compliance level is requested by
user code.

Change some spaces after #define to tabs.

Noted by and discussed with: bde
MFC after: 1 week


# 238627 19-Jul-2012 kib

Remove line which was accidentally kept in r238614.

Submitted by: pjd
Pointy hat to: kib
MFC after: 1 week


# 238614 19-Jul-2012 kib

Implement F_DUPFD_CLOEXEC command for fcntl(2), specified by SUSv4.

PR: standards/169962
Submitted by: Jukka A. Ukkonen <jau iki fi>
MFC after: 1 week


# 238272 09-Jul-2012 mjg

Follow-up commit to r238220:

Pass only FEXEC (instead of FREAD|FEXEC) in fgetvp_exec. _fget has to check for
!FWRITE anyway and may as well know about FREAD.

Make _fget code a bit more readable by converting permission checking from if()
to switch(). Assert that correct permission flags are passed.

In collaboration with: kib
Approved by: trasz (mentor)
MFC after: 6 days
X-MFC: with r238220


# 238220 07-Jul-2012 mjg

Unbreak handling of descriptors opened with O_EXEC by fexecve(2).

While here return EBADF for descriptors opened for writing (previously it was ETXTBSY).

Add fgetvp_exec function which performs appropriate checks.

PR: kern/169651
In collaboration with: kib
Approved by: trasz (mentor)
MFC after: 1 week


# 238029 02-Jul-2012 kib

Extend the KPI to lock and unlock f_offset member of struct file. It
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().

Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.

The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).

Tested by: pho
No objections from: jhb
MFC after: 3 weeks


# 237204 17-Jun-2012 pjd

Don't check for race with close on advisory unlock (there is nothing smart we
can do when such a race occurs). This saves lock/unlock cycle for the filedesc
lock for every advisory unlock operation.

MFC after: 1 month


# 237199 17-Jun-2012 pjd

Extend the comment about checking for a race with close to explain why
it is done and why we don't return an error in such case.

Discussed with: kib
MFC after: 1 month


# 237198 17-Jun-2012 pjd

If VOP_ADVLOCK() call or earlier checks failed don't check for a race with
close, because even if we had a race there is nothing to unlock.

Discussed with: kib
MFC after: 1 month


# 237158 16-Jun-2012 pjd

Revert r237073. 'td' can be NULL here.

MFC after: 1 month


# 237133 15-Jun-2012 pjd

One more attempt to make prototypes formated according to style(9), which
holefully recovers from the "worse than useless" state.

Reported by: bde
MFC after: 1 month


# 237082 14-Jun-2012 pjd

Remove fdtofp() function and use fget_locked(), which works exactly the same.

MFC after: 1 month


# 237080 14-Jun-2012 pjd

Assert that the filedesc lock is being held when the fdunwrap() function
is called.

MFC after: 1 month


# 237077 14-Jun-2012 pjd

Simplify the code by making more use of the fdtofp() function.

MFC after: 1 month


# 237076 14-Jun-2012 pjd

- Assert that the filedesc lock is being held when fdisused() is called.
- Fix white spaces.

MFC after: 1 month


# 237075 14-Jun-2012 pjd

Style fixes and assertions improvements.

MFC after: 1 month


# 237073 14-Jun-2012 pjd

Assert that the filedesc lock is not held when closef() is called.

MFC after: 1 month


# 237070 14-Jun-2012 pjd

Style fixes.

Reported by: bde
MFC after: 1 month


# 237066 14-Jun-2012 pjd

Remove code duplication from fdclosexec(), which was the reason of the bug
fixed in r237065.

MFC after: 1 month


# 237065 14-Jun-2012 pjd

When we are closing capabilities during exec, we want to call mq_fdclose()
on the underlying object and not on the capability itself.

Similar bug was fixed in r236853.

MFC after: 1 month


# 237063 14-Jun-2012 pjd

Style.

MFC after: 1 month


# 237036 13-Jun-2012 pjd

When checking if file descriptor number is valid, explicitely check for 'fd'
being less than 0 instead of using cast-to-unsigned hack.

Today's commit was brought to you by the letters 'B', 'D' and 'E' :)


# 237033 13-Jun-2012 pjd

Allocate descriptor number in dupfdopen() itself instead of depending on
the caller using finstall().
This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes
a race between finstall() and dupfdopen().

MFC after: 1 month


# 237016 13-Jun-2012 pjd

There is only one caller of the dupfdopen() function, so we can simplify
it a bit:
- We can assert that only ENODEV and ENXIO errors are passed instead of
handling other errors.
- The caller always call finstall() for indx descriptor, so we can assume
it is set. Actually the filedesc lock is dropped between finstall() and
dupfdopen(), so there is a window there for another thread to close the
indx descriptor, but it will be closed in next commit.

Reviewed by: mjg
MFC after: 1 month


# 237013 13-Jun-2012 mjg

Remove 'low' argument from fd_last_used().

This function is static and the only caller always passes 0 as low.

While here update note about return values in comment.

Reviewed by: pjd
Approved by: trasz (mentor)
MFC after: 1 month


# 237012 13-Jun-2012 mjg

Re-apply reverted parts of r236935 by pjd with some changes.

If fdalloc() decides to grow fdtable it does it once and at most doubles
the size. This still may be not enough for sufficiently large fd. Use fd
in calculations of new size in order to fix this.

When growing the table, fd is already equal to first free descriptor >= minfd,
also fdgrowtable() no longer drops the filedesc lock. As a result of this there
is no need to retry allocation nor lookup.

Fix description of fd_first_free to note all return values.

In co-operation with: pjd
Approved by: trasz (mentor)
MFC after: 1 month


# 236950 12-Jun-2012 pjd

Revert part of the r236935 for now, until I figure out why it doesn't
work properly.

Reported by: davidxu


# 236935 11-Jun-2012 pjd

fdgrowtable() no longer drops the filedesc lock so it is enough to
retry finding free file descriptor only once after fdgrowtable().

Spotted by: pluknet
MFC after: 1 month


# 236917 11-Jun-2012 pjd

Use consistent way of checking if descriptor number is valid.

MFC after: 1 month


# 236915 11-Jun-2012 pjd

Be consistent with white spaces.

MFC after: 1 month


# 236914 11-Jun-2012 pjd

Remove code duplicated in kern_close() and do_dup() and use closefp() function
introduced a minute ago.

This code duplication was responsible for the bug fixed in r236853.

Discussed with: kib
Tested by: pho
MFC after: 1 month


# 236913 11-Jun-2012 pjd

Introduce closefp() function that we will be able to use to eliminate
code duplication in kern_close() and do_dup().

This is committed separately from the actual removal of the duplicated
code, as the combined diff was very hard to read.

Discussed with: kib
Tested by: pho
MFC after: 1 month


# 236912 11-Jun-2012 pjd

Merge two ifs into one to make the code almost identical to the code in
kern_close().

Discussed with: kib
Tested by: pho
MFC after: 1 month


# 236911 11-Jun-2012 pjd

Move the code around a bit to move two parts of code duplicated from
kern_close() close together.

Discussed with: kib
Tested by: pho
MFC after: 1 month


# 236910 11-Jun-2012 pjd

Now that fdgrowtable() doesn't drop the filedesc lock we don't need to
check if descriptor changed from under us. Replace the check with an
assert.

Discussed with: kib
Tested by: pho
MFC after: 1 month


# 236853 10-Jun-2012 pjd

When we are closing capability during dup2(), we want to call mq_fdclose()
on the underlying object and not on the capability itself.

Discussed with: rwatson
Sponsored by: FreeBSD Foundation
MFC after: 1 month


# 236849 10-Jun-2012 pjd

Merge two ifs into one. Other minor style fixes.

MFC after: 1 month


# 236832 10-Jun-2012 pjd

Simplify fdtofp().

MFC after: 1 month


# 236822 09-Jun-2012 pjd

There is no need to drop the FILEDESC lock around malloc(M_WAITOK) anymore, as
we now use sx lock for filedesc structure protection.

Reviewed by: kib
MFC after: 1 month


# 236821 09-Jun-2012 pjd

Remove now unused variable.

MFC after: 1 month
MFC with: r236820


# 236820 09-Jun-2012 pjd

Make some of the loops more readable.

Reviewed by: tegge
MFC after: 1 month


# 236812 09-Jun-2012 pjd

Correct panic message.

MFC after: 1 month
MFC with: r236731


# 236731 07-Jun-2012 pjd

In fdalloc() f_ofileflags for the newly allocated descriptor has to be 0.
Assert that instead of setting it to 0.

Sponsored by: FreeBSD Foundation
MFC after: 1 month


# 234131 11-Apr-2012 eadler

Return EBADF instead of EMFILE from dup2 when the second argument is
outside the range of valid file descriptors

PR: kern/164970
Submitted by: Peter Jeremy <peterjeremy@acm.org>
Reviewed by: jilles
Approved by: cperciva
MFC after: 1 week


# 233760 01-Apr-2012 jhb

Export some more useful info about shared memory objects to userland
via procstat(1) and fstat(1):
- Change shm file descriptors to track the pathname they are associated
with and add a shm_path() method to copy the path out to a caller-supplied
buffer.
- Use the fo_stat() method of shared memory objects and shm_path() to
export the path, mode, and size of a shared memory object via
struct kinfo_file.
- Add a struct shmstat to the libprocstat(3) interface along with a
procstat_get_shm_info() to export the mode and size of a shared memory
object.
- Change procstat to always print out the path for a given object if it
is valid.
- Teach fstat about shared memory objects and to display their path,
mode, and size.

MFC after: 2 weeks


# 232702 08-Mar-2012 pho

Free up allocated memory used by posix_fadvise(2).


# 227518 14-Nov-2011 obrien

Reformat comment to be more readable in standard Xterm.
(while I'm here, wrap other long lines)


# 227069 04-Nov-2011 jhb

Move the cleanup of f_cdevpriv when the reference count of a devfs
file descriptor drops to zero out of _fdrop() and into devfs_close_f()
as it is only relevant for devfs file descriptors.

Reviewed by: kib
MFC after: 1 week


# 226301 12-Oct-2011 rwatson

Correct a bug in export of capability-related information from the sysctls
supporting procstat -f: properly provide capability rights information to
userspace. The bug resulted from a merge-o during upstreaming (or rather,
a failure to properly merge FreeBSD-side changed downstream).

Spotted by: des, kibab
MFC after: 3 days


# 225617 16-Sep-2011 kmacy

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 224987 18-Aug-2011 jonathan

Add experimental support for process descriptors

A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


# 224914 16-Aug-2011 kib

Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)


# 224810 13-Aug-2011 jonathan

Allow Capsicum capabilities to delegate constrained
access to file system subtrees to sandboxed processes.

- Use of absolute paths and '..' are limited in capability mode.
- Use of absolute paths and '..' are limited when looking up relative
to a capability.
- When a name lookup is performed, identify what operation is to be
performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP.

With these constraints, openat() and friends are now safe in capability
mode, and can then be used by code such as the capability-mode runtime
linker.

Approved by: re (bz), mentor (rwatson)
Sponsored by: Google Inc


# 224778 11-Aug-2011 rwatson

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# 224225 20-Jul-2011 jonathan

Export capability information via sysctls.

When reporting on a capability, flag the fact that it is a capability,
but also unwrap to report all of the usual information about the
underlying file.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


# 224056 15-Jul-2011 jonathan

Add implementation for capabilities.

Code to actually implement Capsicum capabilities, including fileops and
kern_capwrap(), which creates a capability to wrap an existing file
descriptor.

We also modify kern_close() and closef() to handle capabilities.

Finally, remove cap_filelist from struct capability, since we don't
actually need it.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc


# 223866 08-Jul-2011 jonathan

Fix the "passability" test in fdcopy().

Rather than checking to see if a descriptor is a kqueue, check to see if
its fileops flags include DFLAG_PASSABLE.

At the moment, these two tests are equivalent, but this will change with
the addition of capabilities that wrap kqueues but are themselves of type
DTYPE_CAPABILITY. We already have the DFLAG_PASSABLE abstraction, so let's
use it.

This change has been tested with [the newly improved] tools/regression/kqueue.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc


# 223825 06-Jul-2011 trasz

All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".


# 223785 05-Jul-2011 jonathan

Rework _fget to accept capability parameters.

This new version of _fget() requires new parameters:
- cap_rights_t needrights
the rights that we expect the capability's rights mask to include
(e.g. CAP_READ if we are going to read from the file)

- cap_rights_t *haverights
used to return the capability's rights mask (ignored if NULL)

- u_char *maxprotp
the maximum mmap() rights (e.g. VM_PROT_READ) that can be permitted
(only used if we are going to mmap the file; ignored if NULL)

- int fget_flags
FGET_GETCAP if we want to return the capability itself, rather than
the underlying object which it wraps

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc


# 223694 30-Jun-2011 jonathan

When Capsicum starts creating capabilities to wrap existing file
descriptors, we will want to allocate a new descriptor without installing
it in the FD array.

Split falloc() into falloc_noinstall() and finstall(), and rewrite
falloc() to call them with appropriate atomicity.

Approved by: mentor (rwatson), re (bz)


# 221808 12-May-2011 stas

- Do no try to drop a NULL filedesc pointer.


# 221807 12-May-2011 stas

- Commit work from libprocstat project. These patches add support for runtime
file and processes information retrieval from the running kernel via sysctl
in the form of new library, libprocstat. The library also supports KVM backend
for analyzing memory crash dumps. Both procstat(1) and fstat(1) utilities have
been modified to take advantage of the library (as the bonus point the fstat(1)
utility no longer need superuser privileges to operate), and the procstat(1)
utility is now able to display information from memory dumps as well.

The newly introduced fuser(1) utility also uses this library and able to operate
via sysctl and kvm backends.

The library is by no means complete (e.g. KVM backend is missing vnode name
resolution routines, and there're no manpages for the library itself) so I
plan to improve it further. I'm commiting it so it will get wider exposure
and review.

We won't be able to MFC this work as it relies on changes in HEAD, which
was introduced some time ago, that break kernel ABI. OTOH we may be able
to merge the library with KVM backend if we really need it there.

Discussed with: rwatson


# 220400 06-Apr-2011 trasz

Add RACCT_NOFILE accounting.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 220245 01-Apr-2011 kib

After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)
and remove the falloc() version that lacks flag argument. This is done
to reduce the KPI bloat.

Requested by: jhb
X-MFC-note: do not


# 219999 25-Mar-2011 kib

Add O_CLOEXEC flag to open(2) and fhopen(2).
The new function fallocf(9), that is renamed falloc(9) with added
flag argument, is provided to facilitate the merge to stable branch.

Reviewed by: jhb
MFC after: 1 week


# 219968 24-Mar-2011 jhb

Fix some locking nits with the p_state field of struct proc:
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.

MFC after: 1 week


# 218757 16-Feb-2011 bz

Mfp4 CH=177274,177280,177284-177285,177297,177324-177325

VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.

While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.

The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec

Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks


# 218019 28-Jan-2011 jilles

Do not trip a KASSERT if /dev/null cannot be opened for a setuid program.

The fdcheckstd() function makes sure fds 0, 1 and 2 are open by opening
/dev/null. If this fails (e.g. missing devfs or wrong permissions),
fdcheckstd() will return failure and the process will exit as if it received
SIGABRT. The KASSERT is only to check that kern_open() returns the expected
fd, given that it succeeded.

Tripping the KASSERT is most likely if fd 0 is open but fd 1 or 2 are not.

MFC after: 2 weeks


# 216952 04-Jan-2011 kib

Finish r210923, 210926. Mark some devices as eternal.

MFC after: 2 weeks


# 207116 23-Apr-2010 bz

Remove one zero from the double-0.
This code doesn't have a license to kill.

MFC after: 3 days


# 199617 20-Nov-2009 kib

On the return path from F_RDAHEAD and F_READAHEAD fcntls, do not
unlock Giant twice.

While there, bring conditions in the do/while loops closer to style,
that also makes the lines fit into 80 columns.

Reported and tested by: dougb


# 197579 28-Sep-2009 delphij

Add two new fcntls to enable/disable read-ahead:

- F_READAHEAD: specify the amount for sequential access. The amount is
specified in bytes and is rounded up to nearest block size.
- F_RDAHEAD: Darwin compatible version that use 128KB as the sequential
access size.

A third argument of zero disables the read-ahead behavior.

Please note that the read-ahead amount is also constrainted by sysctl
variable, vfs.read_max, which may need to be raised in order to better
utilize this feature.

Thanks Igor Sysoev for proposing the feature and submitting the original
version, and kib@ for his valuable comments.

Submitted by: Igor Sysoev <is rambler-co ru>
Reviewed by: kib@
MFC after: 1 month


# 195104 27-Jun-2009 rwatson

Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week


# 194881 24-Jun-2009 lulf

- Similar to the previous commit, but for CURRENT: Fix a bug where a FIFO vnode
use count was increased twice, but only decreased once.


# 194879 24-Jun-2009 lulf

- Fix a bug where a FIFO vnode use count was increased twice, but only
decreased once.

MFC after: 1 week


# 194262 15-Jun-2009 jhb

Add a new 'void closefrom(int lowfd)' system call. When called, it closes
any open file descriptors >= 'lowfd'. It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD. One difference from other *BSD is that this closefrom() does not
fail with any errors. In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR. DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.

Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads. As such, it is not
multithread safe.

Submitted by: rwatson (initial version)
Reviewed by: rwatson
MFC after: 2 weeks


# 193301 02-Jun-2009 jeff

- Use an acquire barrier to increment f_count in fget_unlocked and
remove the volatile cast. Describe the reason in detail in a comment.

Discussed with: bde, jhb


# 192895 27-May-2009 jamie

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 192461 20-May-2009 jhb

Set the umask in a new file descriptor table earlier in fdcopy() to remove
two lock operations.


# 192144 15-May-2009 kib

Revert r192094. The revision caused problems for sysctl(3) consumers
that expect that oldlen is filled with required buffer length even when
supplied buffer is too short and returned error is ENOMEM.

Redo the fix for kern.proc.filedesc, by reverting the req->oldidx when
remaining buffer space is too short for the current kinfo_file structure.
Also, only ignore ENOMEM. We have to convert ENOMEM to no error condition
to keep existing interface for the sysctl, though.

Reported by: ed, Florian Smeets <flo kasimir com>
Tested by: pho


# 192080 14-May-2009 jeff

- Implement a lockless file descriptor lookup algorithm in
fget_unlocked().
- Save old file descriptor tables created on expansion until
the entire descriptor table is freed so that pointers may be
followed without regard for expanders.
- Mark the file zone as NOFREE so we may attempt to reference
potentially freed files.
- Convert several fget_locked() users to fget_unlocked(). This
requires us to manage reference counts explicitly but reduces
locking overhead in the common case.


# 191113 15-Apr-2009 jhb

Update comment above _fget() for earlier change to FWRITE failures return
EBADF rather than EINVAL.

Submitted by: Jaakko Heinonen jh saunalahti fi
MFC after: 1 month


# 188616 14-Feb-2009 marcus

Remove the printf's when the vnode to be exported for procstat is not a VDIR.
If the file system backing a process' cwd is removed, and procstat -f PID
is called, then these messages would have been printed. The extra verbosity is
not required in this situation.

Requested by: kib
Approved by: kib


# 188613 14-Feb-2009 marcus

Change two KASSERTS to printfs and simple returns. Stress testing has
revealed that a process' current working directory can be VBAD if the
directory is removed. This can trigger a panic when procstat -f PID is
run.

Tested by: pho
Discovered by: phobot
Reviewed by: kib
Approved by: kib


# 188485 11-Feb-2009 rwatson

Modify fdcopy() so that, during fork(2), it won't copy file descriptors
from the parent to the child process if they have an operation vector
of &badfileops. This narrows a set of races involving system calls that
allocate a new file descriptor, potentially block for some extended
period, and then return the file descriptor, when invoked by a threaded
program that concurrently invokes fork(2). Similar approches are used
in both Solaris and Linux, and the wideness of this race was introduced
in FreeBSD when we moved to a more optimistic implementation of
accept(2) in order to simplify locking.

A small race necessarily remains because the fork(2) might occur after
the finit() in accept(2) but before the system call has returned, but
that appears unavoidable using current APIs. However, this race is
vastly narrower.

The fix can be validated using the newfileops_on_fork regression test.

PR: kern/130348
Reported by: Ivan Shcheklein <shcheklein at gmail dot com>
Reviewed by: jhb, kib
MFC after: 1 week


# 186601 30-Dec-2008 kib

Clear the pointers to the file in the struct filedesc before file is closed
in fdfree. Otherwise, sysctl_kern_proc_filedesc may dereference stale
struct file * values.

Reported and tested by: pho
MFC after: 1 month


# 185548 02-Dec-2008 peter

Merge user/peter/kinfo branch as of r185547 into head.

This changes struct kinfo_filedesc and kinfo_vmentry such that they are
same on both 32 and 64 bit platforms like i386/amd64 and won't require
sysctl wrapping.

Two new OIDs are assigned. The old ones are available under
COMPAT_FREEBSD7 - but it isn't that simple. The superceded interface
was never actually released on 7.x.

The other main change is to pack the data passed to userland via the
sysctl. kf_structsize and kve_structsize are reduced for the copyout.
If you have a process with 100,000+ sockets open, the unpacked records
require a 132MB+ copyout. With packing, it is "only" ~35MB. (Still
seriously unpleasant, but not quite as devastating). A similar problem
exists for the vmentry structure - have lots and lots of shared libraries
and small mmaps and its copyout gets expensive too.

My immediate problem is valgrind. It traditionally achieves this
functionality by parsing procfs output, in a packed format. Secondly, when
tracing 32 bit binaries on amd64 under valgrind, it uses a cross compiled
32 bit binary which ran directly into the differing data structures in 32
vs 64 bit mode. (valgrind uses this to track file descriptor operations
and this therefore affected every single 32 bit binary)

I've added two utility functions to libutil to unpack the structures into
a fixed record length and to make it a little more convenient to use.


# 184652 04-Nov-2008 jhb

Remove unnecessary locking around vn_fullpath(). The vnode lock for the
vnode in question does not need to be held. All the data structures used
during the name lookup are protected by the global name cache lock.
Instead, the caller merely needs to ensure a reference is held on the
vnode (such as vhold()) to keep it from being freed.

In the case of procfs' <pid>/file entry, grab the process lock while we
gain a new reference (via vhold()) on p_textvp to fully close races with
execve(2).

For the kern.proc.vmmap sysctl handler, use a shared vnode lock around
the call to VOP_GETATTR() rather than an exclusive lock.

MFC after: 1 month


# 184600 03-Nov-2008 jhb

Use shared vnode locks instead of exclusive vnode locks for the access(),
chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek()
(when figuring out the current size of the file in the SEEK_END case),
pathconf(), readlink(), and statfs() system calls.

Submitted by: ups (mostly)
Tested by: pho
MFC after: 1 month


# 184214 23-Oct-2008 des

Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.


# 184205 23-Oct-2008 des

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# 183808 12-Oct-2008 rwatson

Downgrade XXX to a Note for fgetsock() and fputsock().

MFC after: 3 days


# 181905 20-Aug-2008 ed

Integrate the new MPSAFE TTY layer to the FreeBSD operating system.

The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:

- Improved driver model:

The old TTY layer has a driver model that is not abstract enough to
make it friendly to use. A good example is the output path, where the
device drivers directly access the output buffers. This means that an
in-kernel PPP implementation must always convert network buffers into
TTY buffers.

If a PPP implementation would be built on top of the new TTY layer
(still needs a hooks layer, though), it would allow the PPP
implementation to directly hand the data to the TTY driver.

- Improved hotplugging:

With the old TTY layer, it isn't entirely safe to destroy TTY's from
the system. This implementation has a two-step destructing design,
where the driver first abandons the TTY. After all threads have left
the TTY, the TTY layer calls a routine in the driver, which can be
used to free resources (unit numbers, etc).

The pts(4) driver also implements this feature, which means
posix_openpt() will now return PTY's that are created on the fly.

- Improved performance:

One of the major improvements is the per-TTY mutex, which is expected
to improve scalability when compared to the old Giant locking.
Another change is the unbuffered copying to userspace, which is both
used on TTY device nodes and PTY masters.

Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.

Obtained from: //depot/projects/mpsafetty/...
Approved by: philip (ex-mentor)
Discussed: on the lists, at BSDCan, at the DevSummit
Sponsored by: Snow B.V., the Netherlands
dcons(4) fixed by: kan


# 181465 09-Aug-2008 ed

Remove unneeded D_NEEDGIANT from /dev/fd/{0,1,2}.

There is no reason the fdopen() routine needs Giant. It only sets
curthread->td_dupfd, based on the device unit number of the cdev.

I guess we won't get massive performance improvements here, but still, I
assume we eventually want to get rid of Giant.


# 180059 27-Jun-2008 jhb

Rework the lifetime management of the kernel implementation of POSIX
semaphores. Specifically, semaphores are now represented as new file
descriptor type that is set to close on exec. This removes the need for
all of the manual process reference counting (and fork, exec, and exit
event handlers) as the normal file descriptor operations handle all of
that for us nicely. It is also suggested as one possible implementation
in the spec and at least one other OS (OS X) uses this approach.

Some bugs that were fixed as a result include:
- References to a named semaphore whose name is removed still work after
the sem_unlink() operation. Prior to this patch, if a semaphore's name
was removed, valid handles from sem_open() would get EINVAL errors from
sem_getvalue(), sem_post(), etc. This fixes that.
- Unnamed semaphores created with sem_init() were not cleaned up when a
process exited or exec'd. They were only cleaned up if the process
did an explicit sem_destroy(). This could result in a leak of semaphore
objects that could never be cleaned up.
- On the other hand, if another process guessed the id (kernel pointer to
'struct ksem' of an unnamed semaphore (created via sem_init)) and had
write access to the semaphore based on UID/GID checks, then that other
process could manipulate the semaphore via sem_destroy(), sem_post(),
sem_wait(), etc.
- As part of the permission check (UID/GID), the umask of the proces
creating the semaphore was not honored. Thus if your umask denied group
read/write access but the explicit mode in the sem_init() call allowed
it, the semaphore would be readable/writable by other users in the
same group, for example. This includes access via the previous bug.
- If the module refused to unload because there were active semaphores,
then it might have deregistered one or more of the semaphore system
calls before it noticed that there was a problem. I'm not sure if
this actually happened as the order that modules are discovered by the
kernel linker depends on how the actual .ko file is linked. One can
make the order deterministic by using a single module with a mod_event
handler that explicitly registers syscalls (and deregisters during
unload after any checks). This also fixes a race where even if the
sem_module unloaded first it would have destroyed locks that the
syscalls might be trying to access if they are still executing when
they are unloaded.

XXX: By the way, deregistering system calls doesn't do any blocking
to drain any threads from the calls.
- Some minor fixes to errno values on error. For example, sem_init()
isn't documented to return ENFILE or EMFILE if we run out of semaphores
the way that sem_open() can. Instead, it should return ENOSPC in that
case.

Other changes:
- Kernel semaphores now use a hash table to manage the namespace of
named semaphores nearly in a similar fashion to the POSIX shared memory
object file descriptors. Kernel semaphores can now also have names
longer than 14 chars (up to MAXPATHLEN) and can include subdirectories
in their pathname.
- The UID/GID permission checks for access to a named semaphore are now
done via vaccess() rather than a home-rolled set of checks.
- Now that kernel semaphores have an associated file object, the various
MAC checks for POSIX semaphores accept both a file credential and an
active credential. There is also a new posixsem_check_stat() since it
is possible to fstat() a semaphore file descriptor.
- A small set of regression tests (using the ksem API directly) is present
in src/tools/regression/posixsem.

Reported by: kris (1)
Tested by: kris
Reviewed by: rwatson (lightly)
MFC after: 1 month


# 179385 28-May-2008 ed

Remove redundant checks from fcntl()'s F_DUPFD.

Right now we perform some of the checks inside the fcntl()'s F_DUPFD
operation twice. We first validate the `fd' argument. When finished,
we validate the `arg' argument. These checks are also performed inside
do_dup().

The reason we need to do this, is because fcntl() should return different
errno's when the `arg' argument is out of bounds (EINVAL instead of
EBADF). To prevent the redundant locking of the PROC_LOCK and
FILEDESC_SLOCK, patch do_dup() to support the error semantics required
by fcntl().

Approved by: philip (mentor)


# 179305 25-May-2008 attilio

Replace direct atomic operation for the file refcount witht the
refcount interface.
It also introduces the correct usage of memory barriers, as sometimes
fdrop() and fhold() are used with shared locks, which don't use any
release barrier.


# 179175 21-May-2008 kib

Implement the per-open file data for the cdev.

The patch does not change the cdevsw KBI. Management of the data is
provided by the functions
int devfs_set_cdevpriv(void *priv, cdevpriv_dtr_t dtr);
int devfs_get_cdevpriv(void **datap);
void devfs_clear_cdevpriv(void);
All of the functions are supposed to be called from the cdevsw method
contexts.

- devfs_set_cdevpriv assigns the priv as private data for the file
descriptor which is used to initiate currently performed driver
operation. dtr is the function that will be called when either the
last refernce to the file goes away, the device is destroyed or
devfs_clear_cdevpriv is called.
- devfs_get_cdevpriv is the obvious accessor.
- devfs_clear_cdevpriv allows to clear the private data for the still
open file.

Implementation keeps the driver-supplied pointers in the struct
cdev_privdata, that is referenced both from the struct file and struct
cdev, and cannot outlive any of the referee.

Man pages will be provided after the KPI stabilizes.

Reviewed by: jhb
Useful suggestions from: jeff, antoine
Debugging help and tested by: pho
MFC after: 1 month


# 178586 26-Apr-2008 kris

* Correct a mis-merge that leaked the PROC_LOCK [1]
* Return ENOENT on error instead of 0 [2]

Submitted by: rdivacky [1], kib [2]


# 178465 24-Apr-2008 kris

fdhold can return NULL, so add the one remaining missing check for this
condition.

Reviewed by: attilio
MFC after: 1 week


# 177633 26-Mar-2008 dfr

Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.

* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.

Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks


# 177380 19-Mar-2008 sobomax

Revert previous change - it appears that the limit I was hitting was a
maxsockets limit, not maxfiles limit. The question remains why those
limits are handled differently (with error code for maxfiles but with
sleep for maxsokets), but those would be addressed in a separate commit
if necessary.

Requested by: rwhatson, jeff


# 177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 177232 16-Mar-2008 sobomax

Properly set size of the file_zone to match kern.maxfiles parameter.
Otherwise the parameter is no-op, since zone by default limits number
of descriptors to some 12K entries. Attempt to allocate more ends up
sleeping on zonelimit.

MFC after: 2 weeks


# 176957 08-Mar-2008 antoine

Introduce a new F_DUP2FD command to fcntl(2), for compatibility with
Solaris and AIX.
fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent.
Document it.
Add some regression tests (identical to the dup2(2) regression tests).

PR: 120233
Submitted by: Jukka Ukkonen
Approved by: rwaston (mentor)
MFC after: 1 month


# 176471 22-Feb-2008 des

This patch adds a new ktrace(2) record type, KTR_STRUCT, whose payload
consists of the null-terminated name and the contents of any structure
you wish to record. A new ktrstruct() function constructs and emits a
KTR_STRUCT record. It is accompanied by convenience macros for struct
stat and struct sockaddr.

In kdump(1), KTR_STRUCT records are handled by a dispatcher function
that runs stringent sanity checks on its contents before handing it
over to individual decoding funtions for each type of structure.
Currently supported structures are struct stat and struct sockaddr for
the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK
and AF_IPX is present but disabled, as I am unable to test it properly.

Since 's' was already taken, the letter 't' is used by ktrace(1) to
enable KTR_STRUCT trace points, and in kdump(1) to enable their
decoding.

Derived from patches by Andrew Li <andrew2.li@citi.com>.

PR: kern/117836
MFC after: 3 weeks


# 176269 14-Feb-2008 simon

Fix sendfile(2) write-only file permission bypass.

Security: FreeBSD-SA-08:03.sendfile
Submitted by: kib


# 176124 09-Feb-2008 marcus

Add support for displaying a process' current working directory, root
directory, and jail directory within procstat. While this functionality
is available already in fstat, encapsulating it in the kern.proc.filedesc
sysctl makes it accessible without using kvm and thus without needing
elevated permissions.

The new procstat output looks like:

PID COMM FD T V FLAGS REF OFFSET PRO NAME
76792 tcsh cwd v d -------- - - - /usr/src
76792 tcsh root v d -------- - - - /
76792 tcsh 15 v c rw------ 16 9130 - -
76792 tcsh 16 v c rw------ 16 9130 - -
76792 tcsh 17 v c rw------ 16 9130 - -
76792 tcsh 18 v c rw------ 16 9130 - -
76792 tcsh 19 v c rw------ 16 9130 - -

I am also bumping __FreeBSD_version for this as this new feature will be
used in at least one port.

Reviewed by: rwatson
Approved by: rwatson


# 175514 20-Jan-2008 rwatson

Export a type for POSIX SHM file descriptors via kern.proc.filedesc as
used by procstat, or SHM descriptors will show up as type unknown in
userspace.


# 175294 13-Jan-2008 attilio

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# 175202 09-Jan-2008 attilio

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 175164 08-Jan-2008 jhb

Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.

Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())


# 175140 07-Jan-2008 jhb

Make ftruncate a 'struct file' operation rather than a vnode operation.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.

Submitted by: rwatson


# 175063 02-Jan-2008 jeff

- In sysctl_kern_file skip fdps with negative lastfiles. This can
happen if there are no files open. Accounting for these can
eventually return a negative value for olenp causing sysctl to
crash with a bad malloc.

Reported by: Pawel Worach <pawel.worach@gmail.com>


# 174988 29-Dec-2007 jeff

Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
in such a way that the ops vector is only valid after the data, type,
and flags are valid.
- Protect f_flag and f_count with atomic operations.
- Remove the global list of all files and associated accounting.
- Rewrite the unp garbage collection such that it no longer requires
the global list of all files and instead uses a list of all unp sockets.
- Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by: kris, pho


# 174167 02-Dec-2007 rwatson

Add two new sysctls in support of the forthcoming procstat(1) to support
its -f and -v arguments:

kern.proc.filedesc - dump file descriptor information for a process, if
debugging is permitted, including socket addresses, open flags, file
offsets, file paths, etc.

kern.proc.vmmap - dump virtual memory mapping information for a process,
if debugging is permitted, including layout and information on
underlying objects, such as the type of object and path.

These provide a superset of the information historically available
through the now-deprecated procfs(4), and are intended to be exported
in an ABI-robust form.


# 171744 06-Aug-2007 rwatson

Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.

Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)


# 171194 03-Jul-2007 jeff

- Use explicit locking in the various fcntl case statements so that we
can acquire shared filedescriptor locks in the appropriate cases.
- Remove Giant from calls that issue ioctls. The ioctl path has been
mpsafe for some time now.
- Only acquire giant for VOP_ADVLOCK when the filesystem requires giant.
advlock is now mpsafe.

Reviewed by: rwatson
Approved by: re


# 170850 16-Jun-2007 rwatson

Rather than passing SUSER_RUID into priv_check_cred() to specify when
a privilege is checked against the real uid rather than the effective
uid, instead decide which uid to use in priv_check_cred() based on the
privilege passed in. We use the real uid for PRIV_MAXFILES,
PRIV_MAXPROC, and PRIV_PROC_LIMIT. Remove the definition of
SUSER_RUID; there are now no flags defined for priv_check_cred().

Obtained from: TrustedBSD Project


# 170152 31-May-2007 kib

Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)


# 169250 04-May-2007 kib

Mark the filedescriptor table entries with VOP_OPEN being performed for them
as UF_OPENING. Disable closing of that entries. This should fix the crashes
caused by devfs_open() (and fifo_open()) dereferencing struct file * by
index, while the filedescriptor is closed by parallel thread.

Idea by: tegge
Reviewed by: tegge (previous version of patch)
Tested by: Peter Holm
Approved by: re (kensmith)
MFC after: 3 weeks


# 169060 26-Apr-2007 jhb

Avoid a lot of code duplication by using kern_open() to open /dev/null
in fdcheckstd() instead of a stripped down version of kern_open()'s code.

MFC after: 1 week
Reviewed by: cperciva


# 168355 04-Apr-2007 rwatson

Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by: kris
Discussed with: jhb, kris, attilio, jeff


# 167621 15-Mar-2007 jhb

Just use 'fdrop()' instead of 'FILE_LOCK(); fdrop_locked()' in
dupfdopen(). While I'm at it, move the second fdrop() out from under the
filedesc lock.


# 167232 05-Mar-2007 rwatson

Further system call comment cleanup:

- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.


# 167211 04-Mar-2007 rwatson

Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.


# 166748 15-Feb-2007 rwatson

Catch up file descriptor printing function in DDB to the addition of kqueues
and POSIX message queues.


# 166747 15-Feb-2007 rwatson

Break file descriptor printing logic out of db_show_files() into
db_print_file(), and add a new "show file <ptr>" DDB command, which can
be used to print out file descriptors referenced in stack traces.


# 166073 17-Jan-2007 delphij

Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.


# 165810 05-Jan-2007 jhb

- Close a race between enumerating UNIX domain socket pcb structures via
sysctl and socket teardown by adding a reference count to the UNIX domain
pcb object and fixing the sysctl that enumerates unpcbs to grab a
reference on each unpcb while it builds the list to copy out to userland.
- Close a race between UNIX domain pcb garbage collection (unp_gc()) and
file descriptor teardown (fdrop()) by adding a new garbage collection
flag FWAIT. unp_gc() sets FWAIT while it walks the message buffers
in a UNIX domain socket looking for nested file descriptor references
and clears the flag when it is finished. fdrop() checks to see if the
flag is set on a file descriptor whose refcount just dropped to 0 and
waits for unp_gc() to clear the flag before completely destroying the
file descriptor.

MFC after: 1 week
Reviewed by: rwatson
Submitted by: ups
Hopefully makes the panics go away: mx1


# 164033 06-Nov-2006 rwatson

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 162593 24-Sep-2006 jmg

return EBADF instead of successfully attaching (and then panicing) when
an fd is dieing..

Convinced by: jhb
PR: 103127


# 160556 21-Jul-2006 jhb

Add a comment to explain what fdclose() does and what it's purpose is
since the subtlety eluded me when I looked at it last week.


# 160190 08-Jul-2006 jhb

Add a kern_close() so that the ABIs can close a file descriptor w/o having
to populate a close_args struct and change some of the places that do.


# 159975 27-Jun-2006 pjd

Compress direct cr_ruid comparsion and jailed() call to suser_cred(9).

Reviewed by: rwatson


# 157363 01-Apr-2006 rwatson

Mark fgetsock() and fputsock() as depcrecated: callers should rely on
the file descriptor reference, rather than paying additional lock
operations to acquire a socket reference from the file descriptor.
This will also help to ensure that file descriptor based socket
requests are not delivered to a socket after close. Most consumers
have already been converted to this model.

MFC after: 3 months


# 156900 19-Mar-2006 csjp

Restore fd optimization with a few minor tweaks, to quote tegge:

"fdinit() fails to initialize newfdp->fd_fd.fd_lastfile to -1. This breaks
fdcopy() which will incorrectly set newfdp->fd_freefile to 1 if no files are
open and the last file descriptor marked as unused for fdp was 0. This later
causes descriptor 0 to be unavailable in newfdp when the optimization is
enabled.

When the last file descriptor previously marked as used is nonzero and marked
as unused, fdunused() incorrectly sets fdp->fd_lastfile to fd - 1 due to
fd_last_used() returning (size - 1). This hides the problem that breaks the
optimization."

This allows us to keep the optimization, while un-breaking it.

This is a RELENG_6 candidate.

PR: kern/87208
MFC after: 1 week
Submitted by: tegge


# 156861 18-Mar-2006 csjp

Back out fd optimization introduced in revision 1.280 as it appears to be
really breaking things. Simple "close(0); dup(fd)" does not return descriptor
"0" in some cases. Further, this change also breaks some MAC interactions with
mac_execve_will_transition(). Under certain circumstances, fdcheckstd() can
be called in execve(2) causing an assertion that checks to make sure that
stdin, stdout and stderr reside at indexes 0, 1 and 2 in the process fd table
to fail, resulting in a kernel panic when INVARIANTS is on.

This should also kill the "dup(2) regression on 6.x" show stopper item on the
6.1-RELEASE TODO list.

This is a RELENG_6 candidate.

PR: kern/87208
Silence from: des
MFC after: 1 week


# 155362 05-Feb-2006 wsalamon

Add auditing of arguments to the close() and fstat() system calls. Much more
argument auditing yet to come, for remaining system calls in this file.

Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)


# 154072 06-Jan-2006 jhb

Return EBADF rather than EINVAL for FWRITE failure as per POSIX.

MFC after: 1 week


# 152948 30-Nov-2005 davidxu

Last step to make mq_notify conform to POSIX standard, If the process
has successfully attached a notification request to the message queue
via a queue descriptor, file closing should remove the attachment.


# 152280 10-Nov-2005 rwatson

Add the f_msgcount field to the set of struct file fields printed in show
files.

MFC after: 1 week


# 152275 10-Nov-2005 rwatson

Expanet of details printed for each file descriptor to include it's
garbage collection flags. Reformat generally to make this fit and
leave some room for future expansion.

MFC after: 1 week


# 152272 10-Nov-2005 rwatson

Add a DDB "show files" command to list the current open file list, some
state about each open file, and identify the first process in the process
table that references the file. This is helpful in debugging leaks of
file descriptors.

MFC after: 1 week


# 152252 09-Nov-2005 rwatson

Fix typo in recent comment tweak.

Submitted by: jkim
MFC after: 1 week


# 152250 09-Nov-2005 rwatson

In closef(), remove the assumption that there is a thread associated
with the file descriptor. When a file descriptor is closed as a result
of garbage collecting a UNIX domain socket, the file descriptor will
not have any associated thread, so the logic to identify advisory locks
held by that thread is not appropriate. Check the thread for NULL to
avoid this scenario. Expand an existing comment to say a bit more about
this.

MFC after: 1 week


# 151932 01-Nov-2005 jhb

Push down Giant into fdfree() and remove it from two of the callers.
Other callers such as some rfork() cases weren't locking Giant anyway.

Reviewed by: csjp
MFC after: 1 week


# 151897 31-Oct-2005 rwatson

Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.


# 150935 04-Oct-2005 rik

Use FILEDESC_UNLOCK(fdp) after FILE_UNLOCK(p), not before to avoid LOR.
Slightly discussed on current@.

LOR #055

MFC after: 14 days


# 149491 26-Aug-2005 des

Two minor optimizations of fdalloc():

- if minfd < fd_freefile (as is most often the case, since minfd is
usually 0), set it to fd_freefile.

- remove a call to fd_first_free() which duplicates work already done
by fdused().

This change results in a small but measurable speedup for processes
with large numbers (several thousands) of open files.

PR: kern/85176
Submitted by: Divacky Roman <xdivac02@stud.fit.vutbr.cz>
MFC after: 3 weeks


# 147592 25-Jun-2005 dd

Fix fdcheckstd to pass the file descriptor along through vn_open. When
opening a device, devfs_open needs the file descriptor to install its
own fileops. Failing to pass the file descriptor causes the vnode to
be returned with the regular vnops, which will cause a panic on the
first read or write because devfs_specops is not meant to support
those operations.

This bug caused a panic after exec'ing any set[ug]id program with
fds 0..2 closed (i.e., if any action had to be taken by fdcheckstd, we
would panic if the exec'd program ever tried to use any of those
descriptors).

Reviewed by: phk
Approved by: re (scottl)


# 145820 03-May-2005 jeff

- Use NAMEI to pickup Giant if we need it in fpcheckstd().


# 143298 08-Mar-2005 keramida

Remove redundant initialization that is repeated in the for() loop
right below it.

Approved by: jhb


# 143269 07-Mar-2005 keramida

Typo & grammar fixes in comments.


# 141636 10-Feb-2005 phk

Make some file/filedesc related functions static


# 141471 07-Feb-2005 jhb

- Tweak kern_msgctl() to return a copy of the requested message queue id
structure in the struct pointed to by the 3rd argument for IPC_STAT and
get rid of the 4th argument. The old way returned a pointer into the
kernel array that the calling function would then access afterwards
without holding the appropriate locks and doing non-lock-safe things like
copyout() with the data anyways. This change removes that unsafeness and
resulting race conditions as well as simplifying the interface.
- Implement kern_foo wrappers for stat(), lstat(), fstat(), statfs(),
fstatfs(), and fhstatfs(). Use these wrappers to cut out a lot of
code duplication for freebsd4 and netbsd compatability system calls.
- Add a new lookup function kern_alternate_path() that looks up a filename
under an alternate prefix and determines which filename should be used.
This is basically a more general version of linux_emul_convpath() that
can be shared by all the ABIs thus allowing for further reduction of
code duplication.


# 140782 24-Jan-2005 phk

Don't use VOP_GETVOBJECT, use vp->v_object directly.


# 140710 24-Jan-2005 jeff

- Use VFS_LOCK_GIANT() in place of mtx_lock(&giant), etc.

Sponsored By: Isilon Systems, Inc.


# 139804 06-Jan-2005 imp

/* -> /*- for copyright notices, minor format tweaks as necessary


# 138867 14-Dec-2004 phk

Fix a deadlock I introduced this morning.

Mostly from: tegge


# 138838 14-Dec-2004 phk

Add a new kind of reference count (fd_holdcnt) to struct filedesc
which holds on to just the data structure and the mutex. (The
existing refcount (fd_refcnt) holds onto the open files in the
descriptor.)

The fd_holdcnt is protected by fdesc_mtx, fd_refcnt by FILEDESC_LOCK.

Add fdhold(struct proc *) which gets a hold on the filedescriptors of
the specified proc..

Add fddrop(struct filedesc *) which drops the fd_holdcnt and if zero
destroys the mutex and frees the memory.

Initialize the fd_holdcnt to one in fdinit(). Normal operations on
the filedesc structure will not change it.

In fdfree() use fddrop() to dispose of the mutex and structure. Hold
the FILEDESC_LOCK() until we have cleaned out the contents and carefully
set the fields to null values during cleanup.

Use fdhold()/fddrop() in mountcheckdirs() and sysctl_kern_file().


# 138836 14-Dec-2004 phk

Make fdesc_mtx private to kern_descrip.c now that the flock has come home.


# 138835 14-Dec-2004 phk

Move the checkdirs() function from vfs_mount.c to kern_descrip.c and
call it mountcheckdirs().


# 138832 14-Dec-2004 phk

Add new function fdunshare() which encapsulates the necessary light magic
for ensuring that a process' filedesc is not shared with anybody.

Use it in the two places which previously had private implmentations.

This collects all fd_refcnt handling in kern_descrip.c


# 138361 03-Dec-2004 phk

Sort and wash #includes.


# 138311 02-Dec-2004 phk

Drop ffree() as a separate function and incorporate the only place used.


# 138310 02-Dec-2004 phk

Style polishing.

Use grepable functions
Other minor nitpickings.


# 138263 01-Dec-2004 phk

We already have a lock initialization function, use that for fdesc_mtx
also.

Polish badfo stuff.


# 138262 01-Dec-2004 phk

Collect the stuff for the /dev/fd/{%d,std{in,out,err}} pseudo-device
driver at the bottom of the file.


# 138261 01-Dec-2004 phk

"nfiles" is a bad name for a global variable. Call it "openfiles" instead
as this is more correct and matches the sysctl variable.


# 138260 01-Dec-2004 phk

Style: move data to top of file.


# 138156 28-Nov-2004 rwatson

Don't acquire Giant before calling closef() in close() (and elsewhere);
instead acquire it conditionally in closef() if it is required for
advisory locking. This removes Giant from the close() path of sockets
and pipes (and any other objects that don't acquire Giant in their
fo_close path, such as kqueues). Giant will still be acquired twice for
vnodes -- once for advisory lock teardown, and a second time in the
fo_close method. Both Poul-Henning and I believe that the advisory lock
teardown code can be moved into the vn_closefile path shortly.

This trims a percent or two off the cost of most non-vnode close
operations on SMP, but has a fairly minimal impact on UP where the cost
of a single mutex operation is pretty low.


# 138104 26-Nov-2004 phk

Fix LOR.

Solution pointed out by: jhb


# 137970 21-Nov-2004 das

Neither of the arguments to closef() can be NULL anymore, so don't
check for that.


# 137769 16-Nov-2004 phk

Move a FILEDESC_UNLOCK up to maintain correct nesting of FILEDESC/FILE
locking.


# 137740 15-Nov-2004 phk

Make FILE_LOCK and FILEDESC_LOCK nest properly by postponing the the
release of FILEDESC_LOCK a few more lines.


# 137686 14-Nov-2004 phk

Move #define up.


# 137647 13-Nov-2004 phk

Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.

Use this in all the places where sleeping with the lock held is not
an issue.

The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.


# 137384 08-Nov-2004 phk

Use more intuitive pointer for fdinit() and fdcopy().

Change fdcopy() to take unlocked filedesc.


# 137355 07-Nov-2004 phk

Introduce fdclose() which will clean an entry in a filedesc.

Replace homerolled versions with call to fdclose().

Make fdunused() static to kern_descrip.c


# 137337 07-Nov-2004 phk

Move fdinit() related stuff from .h to .c


# 137331 07-Nov-2004 phk

Allow fdinit() to be called with a NULL fdp argument so we can use
it when setting up init.

Make fdinit() lock the fdp argument as needed.


# 137325 06-Nov-2004 phk

When we open /dev/null for stdin/out/err for safety reasons, do it right:
we should preserve f_data and f_ops if they are already set.


# 136682 18-Oct-2004 rwatson

Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>


# 136109 04-Oct-2004 julian

Another case where we need to guard against a partially
constructed process.

Submitted by: Stephan Uphoff ( ups at tree.com )
MFC after: 3 days


# 134018 19-Aug-2004 rwatson

Remove GIANT_REQUIRED from setugidsafety() as knote_fdclose() no longer
requires Giant.


# 133795 16-Aug-2004 green

Add the missing knote_fdclose().


# 133741 15-Aug-2004 jmg

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 133267 07-Aug-2004 rwatson

We're not yet ready to assert !Giant in kern_fcntl(), as it's called
with Giant from ABI wrappers such as Linux emulation.

Foot shoot off: phk


# 133232 06-Aug-2004 rwatson

Avoid acquiring Giant for some common light-weight or already MPSAFE
fcntl() operations, including:

F_DUPFD dup() alias
F_GETFD retrieve close-on-exec flag
F_SETFD set close-on-exec flag
F_GETFL retrieve file descriptor flags

For the remaining fcntl() operations, do acquire Giant, especially
where we call into fo_ioctl() as a result. We're not yet ready to
push Giant into fo_ioctl(). Once we do, this can all become quite a
bit prettier.


# 133130 04-Aug-2004 rwatson

Assert Giant in the following file descriptor-related functions:

Function Reason
-------- ------
fdfree() VFS
setugidsafety() KQueue
fdcheckstd() VFS
_fgetvp() VFS
fgetsock() Conditional assertion based on debug.mpsafenet


# 132554 22-Jul-2004 rwatson

Push Giant acquisition down into fo_stat() from most callers. Acquire
Giant conditional on debug.mpsafenet in the socket soo_stat() routine,
unconditionally in vn_statfile() for VFS, and otherwise don't acquire
Giant. Accept an unlocked read in kqueue_stat(), and cryptof_stat() is
a no-op. Don't acquire Giant in fstat() system call.

Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS
stack sitting on top, and therefore there will still be Giant recursion
in this case.


# 132549 22-Jul-2004 rwatson

Push acquisition of Giant from fdrop_closed() into fo_close() so that
individual file object implementations can optionally acquire Giant if
they require it:

- soo_close(): depends on debug.mpsafenet
- pipe_close(): Giant not acquired
- kqueue_close(): Giant required
- vn_close(): Giant required
- cryptof_close(): Giant required (conservative)

Notes:

Giant is still acquired in close() even when closing MPSAFE objects
due to kqueue requiring Giant in the calling closef() code.
Microbenchmarks indicate that this removal of Giant cuts 3%-3% off
of pipe create/destroy pairs from user space with SMP compiled into
the kernel.

The cryptodev and opencrypto code appears MPSAFE, but I'm unable to
test it extensively and so have left Giant over fo_close(). It can
probably be removed given some testing and review.


# 132157 14-Jul-2004 csjp

In addition to the real user ID check, do an explicit jail
check to ensure that the caller is not prison root.

The intention is to fix file descriptor creation so that
prison root can not use the last remaining file descriptors.
This privilege should be reserved for non-jailed root users.

Approved by: bmilekic (mentor)


# 130718 19-Jun-2004 phk

Explicitly initialize f_data and f_vnode to NULL.

Report f_vnode to userland in struct xfile.


# 130585 16-Jun-2004 phk

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# 130387 12-Jun-2004 rwatson

Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS


# 130344 11-Jun-2004 phk

Deorbit COMPAT_SUNOS.

We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.


# 129948 01-Jun-2004 rwatson

Push the VOP_ADVLOCK() call to release advisory locks on vnode file
descriptors out of fdrop_locked() and into vn_closefile(). This
removes all knowledge of vnodes from fdrop_locked(), since the lock
behavior was specific to vnodes. This also removes the specific
requirement for Giant in fdrop_locked(), it's now only required by
code that it calls into.

Add GIANT_REQUIRED to vn_closefile() since VFS requires Giant.


# 127911 05-Apr-2004 imp

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 127540 28-Mar-2004 rwatson

Conditionally assert Giant in fputsock() based on the value of
debug.mpsafenet.


# 126253 25-Feb-2004 truckman

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# 126080 21-Feb-2004 phk

Device megapatch 4/6:

Introduce d_version field in struct cdevsw, this must always be
initialized to D_VERSION.

Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing
four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.


# 125881 16-Feb-2004 des

Don't bother storing a result when all you need are the side effects.


# 125852 15-Feb-2004 dwmalone

In fdcheckstd the descriptor table should never be shared, so just
KASSERT this rather than trying to deal with what happens when file
descriptors change out from under us.


# 125454 04-Feb-2004 jhb

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 124602 16-Jan-2004 des

Restore correct semantics for F_DUPFD fcntl. This should fix the errors
people have been getting with configure scripts.


# 124598 16-Jan-2004 des

WITNESS won't let us hold two filedesc locks at the same time, so juggle
fdp and newfdp around a bit.


# 124585 16-Jan-2004 des

Remove two KASSERTs which were overly paranoid.


# 124573 15-Jan-2004 des

Take care to drop locks when calling malloc()


# 124548 15-Jan-2004 des

New file descriptor allocation code, derived from similar code introduced
in OpenBSD by Niels Provos. The patch introduces a bitmap of allocated
file descriptors which is used to locate available descriptors when a new
one is needed. It also moves the task of growing the file descriptor table
out of fdalloc(), reducing complexity in both fdalloc() and do_dup().

Debts of gratitude are owed to tjr@ (who provided the original patch on
which this work is based), grog@ (for the gdb(4) man page) and rwatson@
(for assistance with pxeboot(8)).


# 124390 11-Jan-2004 des

Mechanical whitespace cleanup.


# 124366 11-Jan-2004 alc

Remove long dead code, specifically, code related to munmapfd().
(See also vm/vm_mmap.c revision 1.173.)


# 123939 28-Dec-2003 dwmalone

Plug a leak of open files that happens when you exec a suid program
with one of std{in,out,err} open. This helps with the file descriptor
leaks reported on -current. This should probably be merged into 5.2.

Reviewed by: ru
Tested by: Bjoern A. Zeeb <bzeeb-lists@lists.zabbadoz.net>


# 121256 19-Oct-2003 dwmalone

falloc allocates a file structure and adds it to the file descriptor
table, acquiring the necessary locks as it works. It usually returns
two references to the new descriptor: one in the descriptor table
and one via a pointer argument.

As falloc releases the FILEDESC lock before returning, there is a
potential for a process to close the reference in the file descriptor
table before falloc's caller gets to use the file. I don't think this
can happen in practice at the moment, because Giant indirectly protects
closes.

To stop the file being completly closed in this situation, this change
makes falloc set the refcount to two when both references are returned.
This makes life easier for several of falloc's callers, because the
first thing they previously did was grab an extra reference on the
file.

Reviewed by: iedowse
Idea run past: jhb


# 120659 02-Oct-2003 rwatson

Remove the global variable 'cmask', which was used to initialize the
fd_cmask field in the file descriptor structure for the first process
indirectly from CMASK, and when an fd structure is initialized before
being filled in, and instead just use CMASK. This appears to be an
artifact left over from the initial integration of quotas into BSD.

Suggested by: peter


# 118448 04-Aug-2003 dwmalone

Do some minor Giant pushdown made possible by copyin, fget, fdrop,
malloc and mbuf allocation all not requiring Giant.

1) ostat, fstat and nfstat don't need Giant until they call fo_stat.
2) accept can copyin the address length without grabbing Giant.
3) sendit doesn't need Giant, so don't bother grabbing it until kern_sendit.
4) move Giant grabbing from each indivitual recv* syscall to recvit.


# 118143 29-Jul-2003 alc

Revision 1.51 of vm/uma_core.c modified uma_large_free() to acquire Giant
when needed. So, don't do it here.


# 118126 28-Jul-2003 rwatson

When exporting file descriptor data for threads invoking the
kern.file sysctl, don't return information about processes that
fail p_cansee(td, p). This prevents sockstat and related
programs from seeing file descriptors owned by processes not
in the same jail as the thread, as well as having implications
for MAC, etc.

This is a partial solution: it permits an information leak about
the number of descriptors in the sizing calculation (but this is
not new information, you can also get it from kern.openfiles),
and doesn't attempt to mask file descriptors based on the
properties of the descriptor, only the process referencing it.
However, it provides most of what you want under most
circumstances, without complicating the locking.

PR: 54211
Based on a patch submitted by: Pawel Jakub Dawidek <nick@garage.freebsd.pl>


# 118094 27-Jul-2003 phk

Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.


# 118033 25-Jul-2003 alc

revision 1.51 of vm/uma_core.c modified uma_large_malloc() to acquire
Giant when needed.


# 117494 12-Jul-2003 truckman

Extend the mutex pool implementation to permit the creation and use of
multiple mutex pools with different options and sizes. Mutex pools can
be created with either the default sleep mutexes or with spin mutexes.
A dynamically created mutex pool can now be destroyed if it is no longer
needed.

Create two pools by default, one that matches the existing pool that
uses the MTX_NOWITNESS option that should be used for building higher
level locks, and a new pool with witness checking enabled.

Modify the users of the existing mutex pool to use the appropriate pool
in the new implementation.

Reviewed by: jhb


# 117222 04-Jul-2003 phk

Use the f_vnode field to tell which file descriptors have a vnode.


# 116678 22-Jun-2003 phk

Add a f_vnode field to struct file.

Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.

By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.

At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.


# 116601 20-Jun-2003 phk

Don't (re)initialize f_gcflag to zero.

Move initialization of DTYPE_VNODE specific field f_seqcount into
the DTYPE_VNODE specific code.


# 116585 19-Jun-2003 alfred

Unlock the struct file lock before aquiring Giant, otherwise
we can deadlock because of lock order reversals. This was not
caught because Witness ignores pool mutexes right now.

Diagnosis and help: truckman
Noticed by: pho


# 116564 19-Jun-2003 silby

Add a rate limited message reporting when kern.maxfiles is exceeded,
reporting who did it.

Also, fix a style bug introduced in the previous change.

MFC after: 1 week


# 116548 18-Jun-2003 silby

Forced commit:

Rev 1.201 also removed the out of file descriptor warning messages
displayed to the console. They were not ratelimited, and only made
a bad situation more annoying.


# 116547 18-Jun-2003 silby

Reserve the last 5% of file descriptors for root use. This should allow
systems to fail more gracefully when a file descriptor exhaustion situation
occurs.

Original patch by: David G. Andersen <dga@lcs.mit.edu>
PR: 45353
MFC after: 1 week


# 116546 18-Jun-2003 phk

Initialize struct fileops with C99 sparse initialization.


# 116182 10-Jun-2003 obrien

Use __FBSDID().


# 115702 02-Jun-2003 tegge

Add tracking of process leaders sharing a file descriptor table and
allow a file descriptor table to be shared between multiple process
leaders.

PR: 50923


# 115540 31-May-2003 phk

Remove needless return

Found by: FlexeLint


# 115042 15-May-2003 rwatson

VOP_PATHCONF() requires a vnode lock; this patch adds locking to
fpathconf(). The lock is held for direct calls to VOP_PATHCONF() in
pathconf() already.

Approved by: re (jhb)
Pointed out by: DEBUG_VFS_LOCKS


# 114293 30-Apr-2003 markm

Fix some easy, global, lint warnings. In most cases, this means
making some local variables static. In a couple of cases, this means
removing an unused variable.


# 114216 29-Apr-2003 kan

Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>


# 111815 03-Mar-2003 phk

Gigacommit to improve device-driver source compatibility between
branches:

Initialize struct cdevsw using C99 sparse initializtion and remove
all initializations to default values.

This patch is automatically generated and has been tested by compiling
LINT with all the fields in struct cdevsw in reverse order on alpha,
sparc64 and i386.

Approved by: re(scottl)


# 111708 01-Mar-2003 tegge

Remove unneeded code added in revision 1.188.


# 111412 24-Feb-2003 scottl

Don't NULL out p_fd until after closefd() has been called. This isn't
totally correct, but it has caused breakage for too long. I welcome
someone with more fd fu to fix it correctly.


# 111243 22-Feb-2003 mtm

Remove a comment which hasn't been true since rev. 1.158

Approved by: jhb, markm (mentor)(implicit)


# 111119 19-Feb-2003 imp

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 110962 15-Feb-2003 tegge

Avoid file lock leakage when linuxthreads port or rfork is used:
- Mark the process leader as having an advisory lock
- Check if process leader is marked as having advisory lock when
closing file
- Check that file is still open after lock has been obtained
- Don't allow file descriptor table sharing between processes
with different leaders

PR: 10265
Reviewed by: alfred


# 110908 15-Feb-2003 alfred

Do not allow kqueues to be passed via unix domain sockets.


# 110906 15-Feb-2003 alfred

Fix LOR with PROC/filedesc. Introduce fdesc_mtx that will be used as a
barrier between free'ing filedesc structures. Basically if you want to
access another process's filedesc, you want to hold this mutex over the
entire operation.


# 110676 11-Feb-2003 alfred

Don't lock FILEDESC under PROC.

The locking here needs to be revisited, but this ought to get rid of the
LOR messages that people are complaining about for now. I imagine either
I or someone else interested with smp will eventually clear this up.


# 110088 30-Jan-2003 phk

NODEVFS cleanup: remove #ifdefs


# 109653 21-Jan-2003 hsu

Add missing SMP file locks around read-modify-write operations on
the flag field.

Reviewed by: rwatson


# 109623 21-Jan-2003 alfred

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 109526 19-Jan-2003 phk

Originally when DEVFS was added, a global variable "devfs_present"
was used to control code which were conditional on DEVFS' precense
since this avoided the need for large-scale source pollution with
#include "opt_geom.h"

Now that we approach making DEVFS standard, replace these tests
with an #ifdef to facilitate mechanical removal once DEVFS becomes
non-optional.

No functional change by this commit.


# 109153 12-Jan-2003 dillon

Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.


# 109123 11-Jan-2003 dillon

Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.


# 108790 06-Jan-2003 nectar

Correct file descriptor leaks in lseek and do_dup.
The leak in lseek was introduced in vfs_syscalls.c revision 1.218.
The leak in do_dup was introduced in kern_descrip.c revision 1.158.

Submitted by: iedowse


# 108522 31-Dec-2002 alfred

fdcopy() only needs a filedesc pointer.


# 108521 31-Dec-2002 alfred

purge 'register'.


# 108520 31-Dec-2002 alfred

Since fdshare() and fdinit() only operate on filedescs, make them
take pointers to filedesc structures instead of threads. This makes
it more clear that they do not do any voodoo with the thread/proc
or anything other than the filedesc passed in or returned.

Remove some XXX KSE's as this resolves the issue.


# 108519 31-Dec-2002 alfred

fdinit() does not need to lock the filedesc it is creating as no one
besideds itself has access until the function returns.


# 108325 27-Dec-2002 rwatson

Improve consistency between devfs and MAKEDEV: use UID_ROOT and
GID_WHEEL instead of UID_BIN and GID_BIN for /dev/fd/* entries.

Submitted by: kris


# 108255 24-Dec-2002 phk

White-space changes.


# 108238 23-Dec-2002 phk

Detediousficate declaration of fileops array members by introducing
typedefs for them.


# 107819 13-Dec-2002 tjr

Drop filedesc lock and acquire Giant around calls to malloc() and free().
These call uma_large_malloc() and uma_large_free() which require Giant.
Fixes panic when descriptor table is larger than KMEM_ZMAX bytes
noticed by kkenn.

Reviewed by: jhb


# 107272 26-Nov-2002 jhb

If the file descriptors passed into do_dup() are negative, return EBADF
instead of panicing. Also, perform some of the simpler sanity checks on
the fds before acquiring the filedesc lock.

Approved by: re
Reported by: Dan Nelson <dan@emsphone.com> and others


# 106057 27-Oct-2002 wollman

Change the way support for asynchronous I/O is indicated to applications
to conform to 1003.1-2001. Make it possible for applications to actually
tell whether or not asynchronous I/O is supported.

Since FreeBSD's aio implementation works on all descriptor types, don't
call down into file or vnode ops when [f]pathconf() is asked about
_PC_ASYNC_IO; this avoids the need for every file and vnode op to know about
it.


# 105408 18-Oct-2002 jhb

Don't lock the proc lock to clear p_fd. p_fd isn't protected by the proc
lock.


# 105261 16-Oct-2002 jhb

Many style and whitespace fixes.

Submitted by: bde (mostly)


# 105253 16-Oct-2002 jhb

Sort includes a bit.

Submitted by: bde


# 105161 15-Oct-2002 jhb

Argh. Put back setting of P_ADVLOCK for the F_WRLCK case that was
accidentally lost in the previous revision.

Submitted by: bde
Pointy hat to: jhb


# 105140 14-Oct-2002 jhb

Remove the leaderp variable and just access p_leader directly. The
p_leader field is not protected by the proc lock but is only set during
fork1() by the parent process and never changes.


# 104393 03-Oct-2002 truckman

In an SMP environment post-Giant it is no longer safe to blindly
dereference the struct sigio pointer without any locking. Change
fgetown() to take a reference to the pointer instead of a copy of the
pointer and call SIGIO_LOCK() before copying the pointer and
dereferencing it.

Reviewed by: rwatson


# 103368 15-Sep-2002 tmm

fcntl(..., F_SETLKW, ...) takes a pointer to a struct flock just like
F_SETLK does, so it also needs this structure copied in in fnctl() before
calling kern_fcntl().


# 103314 14-Sep-2002 njl

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


# 103293 13-Sep-2002 tmm

Fix fcntl(..., F_GETOWN, ...) and fcntl(..., F_SETOWN, ...) on sparc64
by not passing a pointer to a register_t or intptr_t when the code in
the lower layers expects one to an int.


# 102912 03-Sep-2002 jhb

- Change falloc() to acquire an fd from the process table last so that
it can do it w/o needing to hold the filelist_lock sx lock.
- fdalloc() doesn't need Giant to call free() anymore. It also doesn't
need to drop and reacquire the filedesc lock around free() now as a
result.
- Try to make the code that copies fd tables when extending the fd table in
fdalloc() a bit more readable by performing assignments in separate
statements. This is still a bit ugly though.
- Use max() instead of an if statement so to figure out the starting point
in the search-for-a-free-fd loop in fdalloc() so it reads better next to
the min() in the previous line.
- Don't grow nfiles in steps up to the size needed if we dup2() to some
really large number. Go ahead and double 'nfiles' in a loop prior
to doing the malloc().
- malloc() doesn't need Giant now.
- Use malloc() and free() instead of MALLOC() and FREE() in fdalloc().
- Check to see if the size we are going to grow to is too big, not if the
current size of the fd table is too big in the loop in fdalloc(). This
means if we are out of space or if dup2() requests too high of a fd,
then we will return an error before we go off and try to allocate some
huge table and copy the existing table into it.
- Move all of the logic for dup'ing a file descriptor into do_dup() instead
of putting some of it in do_dup() and duplicating other parts in four
different places. This makes dup(), dup2(), and fcntl(F_DUPFD) basically
wrappers of do_dup now. fcntl() still has an extra check since it uses
a different error return value in one case then the other functions.
- Add a KASSERT() for an assertion that may not always be true where the
fdcheckstd() function assumes that falloc() returns the fd requested and
not some other fd. I think that the assertion is always true because we
are always single-threaded when we get to this point, but if one was
using rfork() and another process sharing the fd table were playing with
the fd table, there might could be a problem.
- To handle the problem of a file descriptor we are dup()'ing being closed
out from under us in dup() in general, do_dup() now obtains a reference
on the file in question before calling fdalloc(). If after the call to
fdalloc() the file for the fd we are dup'ing is a different file, then
we drop our reference on the original file and return EBADF. This
race was only handled in the dup2() case before and would just retry
the operation. The error return allows the user to know they are being
stupid since they have a locking bug in their app instead of dup'ing
some other descriptor and returning it to them.

Tested on: i386, alpha, sparc64


# 102868 02-Sep-2002 iedowse

Split fcntl() into a wrapper and a kernel-callable kern_fcntl()
implementation. The wrapper is responsible for copying additional
structure arguments (struct flock) to and from userland.


# 102412 25-Aug-2002 charnier

Replace various spelling with FALLTHROUGH which is lint()able


# 102003 17-Aug-2002 rwatson

In continuation of early fileop credential changes, modify fo_ioctl() to
accept an 'active_cred' argument reflecting the credential of the thread
initiating the ioctl operation.

- Change fo_ioctl() to accept active_cred; change consumers of the
fo_ioctl() interface to generally pass active_cred from td->td_ucred.
- In fifofs, initialize filetmp.f_cred to ap->a_cred so that the
invocations of soo_ioctl() are provided access to the calling f_cred.
Pass ap->a_td->td_ucred as the active_cred, but note that this is
required because we don't yet distinguish file_cred and active_cred
in invoking VOP's.
- Update kqueue_ioctl() for its new argument.
- Update pipe_ioctl() for its new argument, pass active_cred rather
than td_ucred to MAC for authorization.
- Update soo_ioctl() for its new argument.
- Update vn_ioctl() for its new argument, use active_cred rather than
td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101983 16-Aug-2002 rwatson

Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101941 15-Aug-2002 rwatson

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 101042 31-Jul-2002 des

Have the kern.file sysctl export xfiles rather than files. The truth is
out there!

Sponsored by: DARPA, NAI Labs


# 100831 28-Jul-2002 truckman

Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.


# 100211 17-Jul-2002 jhb

Preallocate a struct file as the first thing in falloc() before we lock
the filelist_lock and check nfiles. This closes a race where we had to
unlock the filedesc to re-lock the filelist_lock.

Reported by: David Xu
Reviewed by: bde (mostly)


# 99009 28-Jun-2002 alfred

More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.


# 97658 31-May-2002 tanimura

Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu


# 96972 20-May-2002 tanimura

Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred


# 96755 16-May-2002 trhodes

More s/file system/filesystem/g


# 96122 06-May-2002 alfred

Make funsetown() take a 'struct sigio **' so that the locking can
be done internally.

Ensure that no one can fsetown() to a dying process/pgrp. We need
to check the process for P_WEXIT to see if it's exiting. Process
groups are already safe because there is no such thing as a pgrp
zombie, therefore the proctree lock completely protects the pgrp
from having sigio structures associated with it after it runs
funsetownlst.

Add sigio lock to witness list under proctree and allproc, but over
proc and pgrp.

Seigo Tanimura helped with this.


# 95973 03-May-2002 tanimura

As malloc(9) and free(9) are now Giant-free, remove the Giant lock
across malloc(9) and free(9) of a pgrp or a session.


# 95969 03-May-2002 tanimura

Fix the lock order reversal between the sigio lock and a process/pgrp lock in
funsetownlst() by locking the sigio lock across funsetownlst().


# 95883 01-May-2002 alfred

Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.


# 95711 29-Apr-2002 asmodai

Fix indention which I did wrong in a previous commit.

Submitted by: bde


# 95552 27-Apr-2002 tanimura

Add a global sx sigio_lock to protect the pointer to the sigio object
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().

sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.


# 95277 22-Apr-2002 alfred

Don't FILEDESC_LOCK around calls to falloc().


# 95123 20-Apr-2002 tanimura

Push down Giant for setpgid(), setsid() and aio_daemon(). Giant protects only
malloc(9) and free(9).


# 95017 18-Apr-2002 nectar

When exec'ing a set[ug]id program, make sure that the stdio file descriptors
(0, 1, 2) are allocated by opening /dev/null for any which are not already
open.

Reviewed by: alfred, phk
MFC after: 2 days


# 94861 16-Apr-2002 jhb

Lock proctree_lock instead of pgrpsess_lock.


# 94586 13-Apr-2002 asmodai

Use the correct macros for F_SETFD/F_GETFD instead of magic numbers.
Reflect that fact in the manual page.

PR: 12723
Submitted by: Peter Jeremy <peter.jeremy@alcatel.com.au>
Approved by: bde
MFC after: 2 weeks


# 93818 04-Apr-2002 jhb

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 93384 29-Mar-2002 tanimura

The description of fd_mtx is "filedesc structure."


# 92751 20-Mar-2002 jeff

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 92723 19-Mar-2002 alfred

Remove __P.


# 92654 19-Mar-2002 jeff

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# 92641 19-Mar-2002 alfred

Close a race when vfs_syscalls.c:checkdirs() runs.

To do this protect the filedesc pointer in the proc with PROC_LOCK
in both checkdirs() and kern_descrip.c:fdfree().


# 92310 15-Mar-2002 alfred

Giant pushdown for read/write/pread/pwrite syscalls.

kern/kern_descrip.c:
Aquire Giant in fdrop_locked when file refcount hits zero, this removes
the requirement for the caller to own Giant for the most part.

kern/kern_ktrace.c:
Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls.

kern/vfs_bio.c:
Aquire Giant in bwillwrite if needed.

kern/sys_generic.c
Giant pushdown, remove Giant for:
read, pread, write and pwrite.
readv and writev aren't done yet because of the possible malloc calls
for iov to uio processing.

kern/sys_socket.c
Grab giant in the socket fo_read/write functions.

kern/vfs_vnops.c
Grab giant in the vnode fo_read/write functions.


# 91406 27-Feb-2002 jhb

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 91140 23-Feb-2002 tanimura

Lock struct pgrp, session and sigio.

New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)


# 90392 08-Feb-2002 peter

Fix broken Giant locking protocol introduced in rev 1.114. You cannot
unlock Giant if it is not locked in the first place. This make the
nfstat(2) syscall (#278) a nice panic(2) implementation.


# 90089 01-Feb-2002 alfred

Remove bogus assertion in dup2 that can lead to panics when kernel
threads race for a file slot.

dup2(2) incorrectly assumes that if it needs to grow the ofiles
array that it will get what it wants. This assertion was valid
before we allowed shared filedescriptor tables but is now incorrect.

The assertion can trigger superfolous panics if the thread doing a
dup2 looses a race with another thread while possibly blocked in
the MALLOC call in fdalloc. Another thread may grab the slot we
are requesting which makes fdalloc return something other than what
we asked for, this will triggering the bogus assertion.

MFC after: 2 weeks
Reviewed by: phk


# 90088 01-Feb-2002 alfred

Avoid lock order reversal filedesc/Giant when calling FREE() in fdalloc
by unlocking the filedesc before calling FREE().

Submitted by: bde


# 89969 29-Jan-2002 alfred

Attempt to fixup select(2) and poll(2), this should fix some races with
other threads as well as speed up the interfaces.

To fix the race and accomplish the speedup, remove selholddrop and
pollholddrop. The entire concept is somewhat bogus because holding
the individual struct file pointers offers us no guarantees that
another thread context won't close it on us thereby removing our
access to our own reference.

Selholddrop and pollholddrop also would do multiple locks and unlocks
of mutexes _per-file_ in the fd arrays to be scanned, this needed to
be sped up.

Instead of using selholddrop and pollholddrop, simply hold the
filedesc lock over the selscan and pollscan functions. This should
protect us against close(2)'s on the files as reduce the multiple
lock/unlock pairs per fd into a single lock over the filedesc.


# 89961 29-Jan-2002 alfred

Backout 1.120, EINVAL isn't a proper error return when the passed fd is
negative, the 'pointer' referred to by the manpage is actually the
struct file's f_offset field.

Pointed out by: bde


# 89698 23-Jan-2002 alfred

in fget() return EINVAL when the descriptor requested is negative.


# 89595 20-Jan-2002 alfred

use mutex pools for "struct file" locking.
fix indentation of FILE_LOCK/UNLOCK macros while I'm here.


# 89377 14-Jan-2002 alfred

Push down Giant in dup(2) and dup2(2), Giant is only needed when
calling closef() in the case of dup2(2) duping over a descriptor
and when fdalloc must grow or free a filedesc.


# 89319 13-Jan-2002 alfred

Replace ffind_* with fget calls.

Make fget MPsafe.

Make fgetvp and fgetsock use the fget subsystem to reduce code bloat.

Push giant down in fpathconf().


# 89310 13-Jan-2002 alfred

Comment fdrop and fdrop_locked functions.


# 89309 13-Jan-2002 alfred

Implement ffind_hold using ffind_lock.

Recommended by: jhb


# 89306 13-Jan-2002 alfred

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# 87905 14-Dec-2001 jlemon

When removing kqueue descriptors from the descriptor table during a fork,
update fd_freefile and fd_lastfile as well, to keep things in sync.

Pointed out by: Debbie Chu <dchu@juniper.net>


# 86487 17-Nov-2001 dillon

Give struct socket structures a ref counting interface similar to
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.


# 86341 14-Nov-2001 dillon

remove holdfp()

Replace uses of holdfp() with fget*() or fgetvp*() calls as appropriate

introduce fget(), fget_read(), fget_write() - these functions will take
a thread and file descriptor and return a file pointer with its ref
count bumped.

introduce fgetvp(), fgetvp_read(), fgetvp_write() - these functions will
take a thread and file descriptor and return a vref()'d vnode.

*_read() requires that the file pointer be FREAD, *_write that it be
FWRITE.

This continues the cleanup of struct filedesc and struct file access
routines which, when are all through with it, will allow us to then
make the API calls MP safe and be able to move Giant down into the fo_*
functions.


# 84827 11-Oct-2001 jhb

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


# 84169 30-Sep-2001 jlemon

When FREE()ing kqueue related structures, charge them to the correct bucket.

Submitted by: iedowse
Forgotten by: jlemon


# 83374 12-Sep-2001 julian

If an incoming struct proc could have been NULL before, tehn don't
automatically change the code to add

struct proc *p = td->td_proc;

because now 'td' is probably capable of being NULL too.
I expect to see more of this kind of error during the 'weeding'
process. It's too easy to make. (junior hacker project.. look for these :-)

Submitted by: mark Peek <mp@freebsd.org>


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 82749 01-Sep-2001 dillon

Giant Pushdown. Saved the worst P4 tree breakage for last.

reboot() getpriority() setpriority() rtprio() osetrlimit() ogetrlimit()
setrlimit() getrlimit() getrusage() getpid() getppid() getpgrp()
getpgid() getsid() getgid() getegid() getgroups() setsid() setpgid()
setuid() seteuid() setgid() setegid() setgroups() setreuid() setregid()
setresuid() setresgid() getresuid() getresgid () __setugid() getlogin()
setlogin() modnext() modfnext() modstat() modfind() kldload() kldunload()
kldfind() kldnext() kldstat() kldfirstmod() kldsym() getdtablesize()
dup2() dup() fcntl() close() ofstat() fstat() nfsstat() fpathconf()
flock()


# 82516 29-Aug-2001 ache

advlock: simplify overflow checks


# 82188 23-Aug-2001 ache

Move <machine/*> after <sys/*>
Add missing fdrop() before EOVERFLOW

Pointed by: bde


# 82172 23-Aug-2001 ache

Detect off_t EOVERFLOW of start/end offsets calculations for adv. lock,
as POSIX require.


# 81193 06-Aug-2001 chris

Remove the fildesc_clone() function and its associated unnecessary code.
It didn't implement the proper /dev/fd functionality (which would be to
include in the directory listing /dev/fd/n if the process has fd n open)
anyway.

Anything needing access to /dev/fd/n where n > 2 can use the optional
fdescfs module, which implements this properly and does not cause any
trouble with devfs.

Discussed with: phk


# 77183 25-May-2001 rwatson

o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit


# 76166 01-May-2001 markm

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 75893 23-Apr-2001 jhb

Change the pfind() and zpfind() functions to lock the process that they
find before releasing the allproc lock and returning.

Reviewed by: -smp, dfr, jake


# 74810 26-Mar-2001 phk

Send the remains (such as I have located) of "block major numbers" to
the bit-bucket.


# 74523 20-Mar-2001 phk

Make the pseudo-driver for "/dev/fd/*" handle fd's larger than 255.

PR: 25936


# 72521 15-Feb-2001 jlemon

Extend kqueue down to the device layer.

Backwards compatible approach suggested by: peter


# 69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 68883 18-Nov-2000 dillon

This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>


# 67808 28-Oct-2000 alc

Add missing call to knote_fdclose() in setugidsafety() and fdcloseexec().

Reviewed by: jlemon


# 65374 02-Sep-2000 phk

Avoid the modules madness I inadvertently introduced by making the
cloning infrastructure standard in kern_conf. Modules are now
the same with or without devfs support.

If you need to detect if devfs is present, in modules or elsewhere,
check the integer variable "devfs_present".

This happily removes an ugly hack from kern/vfs_conf.c.

This forces a rename of the eventhandler and the standard clone
helper function.

Include <sys/eventhandler.h> in <sys/conf.h>: it's a helper #include
like <sys/queue.h>

Remove all #includes of opt_devfs.h they no longer matter.


# 65122 26-Aug-2000 alfred

new sysctl 'kern.openfiles' (exports nfiles to userland)


# 65052 24-Aug-2000 phk

Dang, a _clone routine escaped #ifdef DEVFS containment.


# 65051 24-Aug-2000 phk

Fix panic when removing open device (found by bp@)
Implement subdirs.
Build the full "devicename" for cloning functions.
Fix panic when deleted device goes away.
Collaps devfs_dir and devfs_dirent structures.
Add proper cloning to the /dev/fd* "device-"driver.
Fix a bug in make_dev_alias() handling which made aliases appear
multiple times.
Use devfs_clone to implement getdiskbyname()
Make specfs maintain the stat(2) timestamps per dev_t


# 64529 11-Aug-2000 peter

Clean up some low level bootstrap code:

- stop using the evil 'struct trapframe' argument for mi_startup()
(formerly main()). There are much better ways of doing it.
- do not use prepare_usermode() - setregs() in execve() will do it
all for us as long as the p_md.md_regs pointer is set. (which is
now done in machdep.c rather than init_main.c. The Alpha port did it
this way all along and is much cleaner).
- collect all the magic %cr0 etc register settings into one place and
have the AP's call that instead of using magic numbers (!!) that keep
changing over and over again.
- Make it safe to call kthread_create() earlier, including during the
device probe sequence. It doesn't need the callback mechanism that
NetBSD's version uses.
- kthreads created this way are root-less as they exist before the root
filesystem is mounted. init(1) is set up so that it aquires the root
pointers prior to running. If other kthreads want filesystem acccess
we can make this code more generic.
- set all threads start times once we have decided what time it is.
- init uses a trampoline rather than the evil prepare_usermode() hack.
- kern_descrip.c has a couple of tweaks to deal with forking when there
is no rootdir or cwd etc.
- adjust the early SYSINIT() sequence so that a few prereqisites are in
place. eg: make sure the run queue is initialized before doing forks.

With this, the USB code can easily create a kthread to do the device
tree discovery. (I have tested it, it works nicely).

There are still some open issues before this is truely useful.
- tsleep() does not like working before the clock is running. It
sort-of tries to spin wait, but it can do more useful things now.
- stopping a kthread in kld code at unload time is "interesting" but
we have a solution for that.

The Alpha code needs no changes for this. It already uses pretty much the
same strategies, but a little cleaner.


# 62573 04-Jul-2000 phk

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


# 62454 03-Jul-2000 phk

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


# 62185 27-Jun-2000 alfred

don't panic the system when fpathconv is called on an unsupported filetype.


# 60938 26-May-2000 jake

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 60833 23-May-2000 jake

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 59288 16-Apr-2000 jlemon

Introduce kqueue() and kevent(), a kernel event notification facility.


# 56363 21-Jan-2000 imp

Fix the style bugs in the style bugs fix. The style bug fix made the
new function inconsistant with the rest of this file. The spelling
and grammer fixes were good and remain.


# 56359 21-Jan-2000 green

Fix style bugs in the last commit.


# 56339 20-Jan-2000 imp

bdeize last commit:
o Remove opt_dontuse.h and ifdef PROCFS

Subitted by: bde, peter


# 56313 20-Jan-2000 imp

When we are execing a setugid program, and we have a procfs filesystem
file open in one of the special file descriptors (0, 1, or 2), close
it before completing the exec.

Submitted by: nergal@idea.avet.com.pl
Constructive comments: deraadt@openbsd.org, sef, peter, jkh


# 55113 26-Dec-1999 bde

Removed unused includes.

Rumoved unused compatibility cruft for dup(). Using it today would just
break dup() on fd's >= 64.

Fixed some style bugs.


# 53347 18-Nov-1999 dillon

Only bother converting the stat structure if we intend to return it,
when no error occurs.

PR: kern/14966
Reviewed by: dillon@freebsd.org
Submitted by: Kelly Yancey kbyanc@posi.net


# 53333 18-Nov-1999 peter

Remove cdevsw_add() - the necessary make_dev() calls appear to be there
already.


# 53212 16-Nov-1999 phk

This is a partial commit of the patch from PR 14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# 52982 08-Nov-1999 peter

Use fo_stat() rather than duplicating knowledge of file type internals
in here for stat(2) and friends. Update the badops entries accordingly.


# 52955 07-Nov-1999 green

Fix the advisory file locking by restoring previous ordering in closef()/
fdrop(). This only showed up when a file descriptor was duplicated
and then closed once, where the lock would be released on the first close().


# 52128 11-Oct-1999 peter

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# 51658 25-Sep-1999 phk

Remove five now unused fields from struct cdevsw. They should never
have been there in the first place. A GENERIC kernel shrinks almost 1k.

Add a slightly different safetybelt under nostop for tty drivers.

Add some missing FreeBSD tags


# 51649 25-Sep-1999 phk

Fix a hole in jail(2).

Noticed by: Alexander Bezroutchko <abb@zenon.net>


# 51418 19-Sep-1999 green

This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by: peter


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 50254 23-Aug-1999 phk

Convert DEVFS hooks in (most) drivers to make_dev().

Diskslice/label code not yet handled.

Vinum, i4b, alpha, pc98 not dealt with (left to respective Maintainers)

Add the correct hook for devfs to kern_conf.c

The net result of this excercise is that a lot less files depends on DEVFS,
and devtoname() gets more sensible output in many cases.

A few drivers had minor additional cleanups performed relating to cdevsw
registration.

A few drivers don't register a cdevsw{} anymore, but only use make_dev().


# 49413 04-Aug-1999 green

Fix fd race conditions (during shared fd table usage.) Badfileops is
now used in f_ops in place of NULL, and modifications to the files
are more carefully ordered. f_ops should also be set to &badfileops
upon "close" of a file.

This does not fix other problems mentioned in this PR than the first
one.

PR: 11629
Reviewed by: peter


# 47829 07-Jun-1999 msmith

From the submitter:

- this causes POSIX locking to use the thread group leader
(p->p_leader) as the locking thread for all advisory locks.
In non-kernel-threaded code p->p_leader == p, so this will have
no effect.

This results in (more) correct POSIX threaded flock-ing semantics.

It also prevents the leader from exiting before any of the children.
(so that p->p_leader will never be stale) in exit1().

We have been running this patch for over a month now in our lab
under load and at customer sites.

Submitted by: John Plevyak <jplevyak@inktomi.com>


# 47640 31-May-1999 phk

Simplify cdevsw registration.

The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.


# 47625 30-May-1999 phk

This commit should be a extensive NO-OP:

Reformat and initialize correctly all "struct cdevsw".

Initialize the d_maj and d_bmaj fields.

The d_reset field was not removed, although it is never used.

I used a program to do most of this, so all the files now use the
same consistent format. Please keep it that way.

Vinum and i4b not modified, patches emailed to respective authors.


# 47028 11-May-1999 phk

Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).


# 46381 03-May-1999 billf

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 46153 28-Apr-1999 dt

s/static foo_devsw_installed = 0;/static int foo_devsw_installed;/.
(Edited automatically)


# 42408 08-Jan-1999 eivind

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# 41087 11-Nov-1998 truckman

I got another batch of suggestions for cosmetic changes from bde.


# 41086 11-Nov-1998 truckman

Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR: kern/7899
Reviewed by: bde, elvind


# 37951 29-Jul-1998 bde

Fixed printf format errors.


# 37659 15-Jul-1998 bde

Cast longs to intptr_t before casting them to pointers.

Fixed bitrot in pseudo-declaration of `struct fcntl_args'. fcntl()
is now broken in some cases when ints are larger than longs.


# 36844 10-Jun-1998 dfr

64bit fixes: p->p_retval is a register_t[] not an int[].


# 35938 11-May-1998 dyson

Fix the futimes/undelete/utrace conflict with other BSD's. Note that
the only common usage of utrace (the possible problem with this
commit) is with malloc, so this should be a real problem. Add
the various NetBSD syscalls that allow full emulation of their
development environment.


# 33360 15-Feb-1998 dyson

Make the rootdir handling more consistent. Now, processes always
have a root vnode associated with them, and no special checks for
the null case are needed.
Submitted by: terry@freebsd.org


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 32726 24-Jan-1998 eivind

Make all file-system (MFS, FFS, NFS, LFS, DEVFS) related option new-style.

This introduce an xxxFS_BOOT for each of the rootable filesystems.
(Presently not required, but encouraged to allow a smooth move of option *FS
to opt_dontuse.h later.)

LFS is temporarily disabled, and will be re-enabled tomorrow.


# 31778 16-Dec-1997 eivind

Make COMPAT_43 and COMPAT_SUNOS new-style options.


# 31443 28-Nov-1997 dyson

Fix and complete the AIO syscalls. There are some performance enhancements
coming up soon, but the code is functional. Docs will be forthcoming.


# 31368 23-Nov-1997 bde

Fixed a missing conversion of retval to p_retval in disabled code.

Fixed overflow of FFLAGS() in fcntl(F_SETFL, ...). This was not
a security hole, but gave wrong results for silly flags values.
E.g., it make fcntl(F_SETFL, -1) equivalent to fcntl(F_SETFL, 0).
POSIX requires ignoring the open mode bits in fcntl() (even if
they would be invalid for open()).


# 31365 23-Nov-1997 bde

Fixed duplicate definitions of M_FILE (one static).


# 30994 06-Nov-1997 phk

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# 30354 12-Oct-1997 phk

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 30309 11-Oct-1997 phk

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# 29361 14-Sep-1997 peter

Various select -> poll changes


# 28766 25-Aug-1997 bde

Removed some stale comments.

Fixed a gratuitous ANSIism.


# 24752 09-Apr-1997 bde

Removed support for OLD_PIPE. <sys/stat.h> is now missing the hack that
supported nameless pipes being indistinguishable from fifos. We're not
going back.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 21002 29-Dec-1996 dyson

This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE: THAT LKMS MUST BE REBUILT!!!


# 20692 19-Dec-1996 bde

Fixed nonexistent checking of lock types for F_GETLK.

Found by: NIST-PCTS


# 20691 19-Dec-1996 bde

Fixed lseek() on named pipes. It always succeeded but should always fail.
Broke locking on named pipes in the same way as locking on non-vnodes
(wrong errno). This will be fixed later.

The fix involves negative logic. Named pipes are now distinguished from
other types of files with vnodes, and there is additional code to handle
vnodes and named pipes in the same way only where that makes sense (not
for lseek, locking or TIOCSCTTY).


# 18541 28-Sep-1996 bde

Fixed bitrot in the read-only attribute:
- kern.maxfilesperproc was read-only (and thus essentially useless).

Removed unused #includes. Strength-reduced used #includes.


# 17609 15-Aug-1996 smpatel

Fix fdavail() so that correctly pays attention to the rlimit.

Fixes unp_externalize panic which occurs when a process is at it's
ulimit for file descriptors and tries to receive a file descriptor from
another process.

Reviewed by: wollman


# 16440 17-Jun-1996 wpaul

Add a couple of #ifdef DEVFS/#endif clauses to slence the following
compiler warnings which occur if you don't have 'options DEVFS' in
your kernel config file:

../../kern/kern_descrip.c: In function `fildesc_drvinit':
../../kern/kern_descrip.c:1103: warning: unused variable `fd'
../../kern/kern_descrip.c: At top level:
../../kern/kern_descrip.c:1095: warning: `devfs_token_stdin' defined but not use
d
../../kern/kern_descrip.c:1096: warning: `devfs_token_stdout' defined but not us
ed
../../kern/kern_descrip.c:1097: warning: `devfs_token_stderr' defined but not us
ed
../../kern/kern_descrip.c:1098: warning: `devfs_token_fildesc' defined but not u
sed


# 16322 12-Jun-1996 gpalmer

Clean up -Wunused warnings.

Reviewed by: bde


# 14849 27-Mar-1996 bde

Fixed the unit numbers of the devfs `fd' devices.

Made the devfs `fd' devices bug for bug compatible with the ones created
by MAKEDEV:
- ownership is bin.bin, not root.wheel, except for std*. The devfsext
interface doesn't seem to allow specifying the ownership of /devfs/fd,
so it's still incompatible.
- std* aren't links to fd/[0-2].


# 14497 11-Mar-1996 hsu

Merge in Lite2: LIST replacement for f_filef, f_fileb, and filehead.
Did not accept change of second argument to ioctl from int to u_long.
Reviewed by: davidg & bde


# 14221 23-Feb-1996 peter

kern_descrip.c: add fdshare()/fdcopy()
kern_fork.c: add the tiny bit of code for rfork operation.
kern/sysv_*: shmfork() takes one less arg, it was never used.
sys/shm.h: drop "isvfork" arg from shmfork() prototype
sys/param.h: declare rfork args.. (this is where OpenBSD put it..)
sys/filedesc.h: protos for fdshare/fdcopy.
vm/vm_mmap.c: add minherit code, add rounding to mmap() type args where
it makes sense.
vm/*: drop unused isvfork arg.

Note: this rfork() implementation copies the address space mappings,
it does not connect the mappings together. ie: once the two processes
have split, the pages may be shared, but the address space is not. If one
does a mmap() etc, it does not appear in the other. This makes it not
useful for pthreads, but it is useful in it's own right for having
light-weight threads in a static shared address space.

Obtained from: Original by Ron Minnich, extended by OpenBSD


# 13907 04-Feb-1996 dyson

Improve the performance for pipe(2) again. Also include some
fixes for previous version of new pipes from Bruce Evans. This
new version:

Supports more properly the semantics of select (BDE).
Supports "OLD_PIPE" correctly (kern_descrip.c, BDE).
Eliminates incorrect EPIPE returns (bash 'pipe broken' messages.)
Much faster yet, currently tuned relatively conservatively -- but now
gives approx 50% more perf than the new pipes code did originally.
(That was about 50% more perf than the original BSD pipe code.)

Known bugs outstanding:
No support for async io (SIGIO). Will be included soon.

Next to do:
Merge support for FIFOs.

Submitted by: bde


# 13676 28-Jan-1996 dyson

Enable the new fast pipe code. The old pipes can be used with the
"OLD_PIPE" config option.


# 12819 14-Dec-1995 phk

A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.


# 12678 08-Dec-1995 phk

Julian forgot to make the *devsw structures static.


# 12675 08-Dec-1995 julian

Pass 3 of the great devsw changes
most devsw referenced functions are now static, as they are
in the same file as their devsw structure. I've also added DEVFS
support for nearly every device in the system, however
many of the devices have 'incorrect' names under DEVFS
because I couldn't quickly work out the correct naming conventions.
(but devfs won't be coming on line for a month or so anyhow so that doesn't
matter)

If you "OWN" a device which would normally have an entry in /dev
then search for the devfs_add_devsw() entries and munge to make them right..
check out similar devices to see what I might have done in them in you
can't see what's going on..
for a laugh compare conf.c conf.h defore and after... :)
I have not doen DEVFS entries for any DISKSLICE devices yet as that will be
a much more complicated job.. (pass 5 :)

pass 4 will be to make the devsw tables of type (cdevsw * )
rather than (cdevsw)
seems to work here..
complaints to the usual places.. :)


# 12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


# 12645 05-Dec-1995 bde

Include <vm/vm.h> or <vm/vm_page.h> explicitly to avoid breaking when
vnode_if.h doesn't include vm stuff.


# 12623 04-Dec-1995 phk

A major sweep over the sysctl stuff.

Move a lot of variables home to their own code (In good time before xmas :-)

Introduce the string descrition of format.

Add a couple more functions to poke into these marvels, while I try to
decide what the correct interface should look like.

Next is adding vars on the fly, and sysctl looking at them too.

Removed a tine bit of defunct and #ifdefed notused code in swapgeneric.


# 12577 02-Dec-1995 bde

Completed function declarations and/or added prototypes.


# 12521 29-Nov-1995 julian

If you're going to mechanically replicate something in 50 files
it's best to not have a (compiles cleanly) typo in it! (sigh)


# 12517 29-Nov-1995 julian

OK, that's it..
That's EVERY SINGLE driver that has an entry in conf.c..
my next trick will be to define cdevsw[] and bdevsw[]
as empty arrays and remove all those DAMNED defines as well..

Each of these drivers has a SYSINIT linker set entry
that comes in very early.. and asks teh driver to add it's own
entry to the two devsw[] tables.

some slight reworking of the commits from yesterday (added the SYSINIT
stuff and some usually wrong but token DEVFS entries to all these
devices.

BTW does anyone know where the 'ata' entries in conf.c actually reside?
seems we don't actually have a 'ataopen() etc...

If you want to add a new device in conf.c
please make sure I know
so I can keep it up to date too..

as before, this is all dependent on #if defined(JREMOD)
(and #ifdef DEVFS in parts)


# 12277 14-Nov-1995 phk

Add new-style sysctl for KERN_FILE here.


# 12221 12-Nov-1995 bde

Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.


# 11607 21-Oct-1995 dg

Killed a few gratuitous #include's.


# 11332 07-Oct-1995 swallace

Remove prototype definitions from <sys/systm.h>.
Prototypes are located in <sys/sysproto.h>.

Add appropriate #include <sys/sysproto.h> to files that needed
protos from systm.h.

Add structure definitions to appropriate files that relied on sys/systm.h,
right before system call definition, as in the rest of the kernel source.

In kern_prot.c, instead of using the dummy structure "args", create
individual dummy structures named <syscall>_args. This makes
life easier for prototype generation.


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 7430 28-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) that I didn't notice when I fixed
"all" such warnings before.


# 6577 20-Feb-1995 guido

Implement maxprocperuid and maxfilesperproc. They are tunable
via sysctl(8). The initial value of maxprocperuid is maxproc-1,
that of maxfilesperproc is maxfiles (untill maxfile will disappear)

Now it is at least possible to prohibit one user opening maxfiles

-Guido

Submitted by:
Obtained from:


# 5082 12-Dec-1994 bde

Obtained from: my fix for 1.1.5

Remove compatibility hack so that dup(fd) isn't interpreted as
dup2(fd & 0x3f, random_junk_on_stack_fd) when (fd & 0x3f) != 0.


# 3308 02-Oct-1994 phk

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# 3098 25-Sep-1994 phk

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# 2458 02-Sep-1994 dg

munmapfd() was being called with one too few params - bug introduced
during my initial kernel port.


# 1817 02-Aug-1994 dg

Added $Id$


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1542 24-May-1994 rgrimes

This commit was generated by cvs2svn to compensate for changes in r1541,
which included commits to RCS files with non-trunk default branches.


# 1541 24-May-1994 rgrimes

BSD 4.4 Lite Kernel Sources