History log of /freebsd-current/sys/sys/syscallsubr.h
Revision Date Author Comments
# 9a9677ec 19-May-2024 Ricardo Branco <rbranco@suse.de>

linux: Update linux manpage to mention mqueuefs

Reviewed by: imp, kib
Pull Request: https://github.com/freebsd/freebsd-src/pull/1248


# e30621d5 18-May-2024 Ricardo Branco <rbranco@suse.de>

mqueue: Introduce kern_kmq_timedreceive & kern_kmq_timedsend

Reviewed by: imp, kib
Pull Request: https://github.com/freebsd/freebsd-src/pull/1248


# f5449510 19-Mar-2024 Konstantin Belousov <kib@FreeBSD.org>

sys/syscallsubr.h: align definition of kern_fcntl_freebsd() on 32bit

Fixes: d0efabdf15d956e9bc0414356ed798ca3c846e08


# 18cb4223 31-Jan-2024 John Baldwin <jhb@FreeBSD.org>

timerfd: Move kern_timerfd_* prototypes to <sys/syscallsubr.h>


# 14505c92 30-Jan-2024 John Baldwin <jhb@FreeBSD.org>

syscallsubr.h: Sort kern_membarrier prototype alphabetically


# c662306e 20-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

Add kern_openatfp(9)

Reviewed by: markj, pjd
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43529


# 2a284076 20-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

kern_openat(): rename fd argument to dirfd

Reviewed by: markj, pjd
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43529


# d8decc9a 19-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

Add kcmp(2) kernel bits

This is based purely on reading the Linux kcmp(2) man page.
In addition to the Linux set of comparators, I also added KCMP_FILEOBJ to
compare underlying file' objects.

Tested by: manu
Reviewed by: brooks, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43518


# 0fac350c 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on getpeername/getsockname

Just like it was done for accept(2) in cfb1e92912b4, use same approach
for two simplier syscalls that return socket addresses. Although,
these two syscalls aren't performance critical, this change generalizes
some code between 3 syscalls trimming code size.

Following example of accept(2), provide VNET-aware and INVARIANT-checking
wrappers sopeeraddr() and sosockaddr() around protosw methods.

Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D42694


# cfb1e929 30-Nov-2023 Gleb Smirnoff <glebius@FreeBSD.org>

sockets: don't malloc/free sockaddr memory on accept(2)

Let the accept functions provide stack memory for protocols to fill it in.
Generic code should provide sockaddr_storage, specialized code may provide
smaller structure.

While rewriting accept(2) make 'addrlen' a true in/out parameter, reporting
required length in case if provided length was insufficient. Our manual
page accept(2) and POSIX don't explicitly require that, but one can read
the text as they do. Linux also does that. Update tests accordingly.

Reviewed by: rscheff, tuexen, zlei, dchagin
Differential Revision: https://reviews.freebsd.org/D42635


# 3555be01 08-Sep-2023 John Baldwin <jhb@FreeBSD.org>

Move kern_extattr_* prototypes to <sys/syscallsubr.h>

All of the kern_* prototypes belong in this header. While here, sort
the prototypes by function name.

Reviewed by: dchagin
Fixes: 6453d4240f6b vfs: Export exattr methods to reuse by Linuxulator
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D41766


# 4a69fc16 07-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

Add membarrier(2)

This is an attempt at clean-room implementation of the Linux'
membarrier(2) syscall. For documentation, you would need to read
both membarrier(2) Linux man page, the comments in Linux
kernel/sched/membarrier.c implementation and possibly look at
actual uses.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32360


# 95ee2897 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: two-line .h pattern

Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/


# 9b65fa69 29-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

linuxolator: implement Linux' PROT_GROWSDOWN

From the Linux man page for mprotect(2):
PROT_GROWSDOWN
Apply the protection mode down to the beginning of a mapping
that grows downward (which should be a stack segment or a
segment mapped with the MAP_GROWSDOWN flag set).

Reported by: dchagin
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41099


# 07c0b6e5 29-May-2023 Dmitry Chagin <dchagin@FreeBSD.org>

vfs: Retire kern_alternate_path() as unused anymore

From now a non-native ABI should use pwd_altroot() ability to tell
to the namei() its root directory to dynamically reroots lookups.

Differential Revision: https://reviews.freebsd.org/D40093
MFC after: 2 month


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# cb858340 28-Apr-2023 Dmitry Chagin <dchagin@FreeBSD.org>

linux(4): Add a dedicated statat() implementation

Get rid of calling Linux stat translation hook and specific to Linux
handling of non-vnode dirfd from kern_statat(),

Reviewed by: kib, mjg
Differential revision: https://reviews.freebsd.org/D35474


# d46174cd 28-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

Finish cpuset_getaffinity() after f35093f8

Split cpuset_getaffinity() into a two counterparts, where the
user_cpuset_getaffinity() is intended to operate on the cpuset_t from
user va, while kern_cpuset_getaffinity() expects the cpuset from kernel
va.
Accordingly, the code that clears the high bits is moved to the
user_cpuset_getaffinity(). Linux sched_getaffinity() syscall returns
the size of set copied to the user-space and then glibc wrapper clears
the high bits.

MFC after: 2 weeks


# 47a57144 12-May-2022 Justin Hibbits <jhibbits@FreeBSD.org>

cpuset: Byte swap cpuset for compat32 on big endian architectures

Summary:
BITSET uses long as its basic underlying type, which is dependent on the
compile type, meaning on 32-bit builds the basic type is 32 bits, but on
64-bit builds it's 64 bits. On little endian architectures this doesn't
matter, because the LSB is always at the low bit, so the words get
effectively concatenated moving between 32-bit and 64-bit, but on
big-endian architectures it throws a wrench in, as setting bit 0 in
32-bit mode is equivalent to setting bit 32 in 64-bit mode. To
demonstrate:

32-bit mode:

BIT_SET(foo, 0): 0x00000001

64-bit sees: 0x0000000100000000

cpuset is the only system interface that uses bitsets, so solve this
by swapping the integer sub-components at the copyin/copyout points.

Reviewed by: kib
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D35225


# f35093f8 11-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

Use Linux semantics for the thread affinity syscalls.

Linux has more tolerant checks of the user supplied cpuset_t's.

Minimum cpuset_t size that the Linux kernel permits in case of
getaffinity() is the maximum CPU id, present in the system / NBBY,
the maximum size is not limited.
For setaffinity(), Linux does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where
the upper bound is the maximum CPU id, present in the system, no larger
than the size of the kernel cpuset_t.
Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(),
so clear it in the sched_setaffinity() and Linuxulator itself.

Reviewed by: Pau Amma (man pages)
In collaboration with: jhb
Differential revision: https://reviews.freebsd.org/D34849
MFC after: 2 weeks


# f04534f5 06-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

sysvsem: Add a timeout argument to the semop.

For future use in the Linux emulation layer for the semtimedop syscall
split the sys_semop syscall into two counterparts and add
struct timespec *timeout argument to the last one.

Reviewed by: jhb, kib
Differential revision: https://reviews.freebsd.org/D35121
MFC after: 2 weeks


# f3f3e3c4 03-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

fd: add close_range(..., CLOSE_RANGE_CLOEXEC)

For compatibility with Linux.

MFC after: 3 days
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D34424


# c40fee6f 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the always curthread argument from kern_alternate_path


# 2b9d052d 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: fix getfsstat sign extension bugs

Add freebsd32 versions of getfsstat and freebsd11_getfsstat so that
bufsize is properly sign-extended if a negative value is passed.
Reject negative values before passing to kern_getfsstat as a size_t.

Reviewed by: kevans


# e02f64d9 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

freebsd32: add real abort2

Previously, the code would copy twice as many pointers as specified
and print pairs of them a single 64-bit pointer.

abort2 doesn't return so make the return type void

freebsd32_abort2 is in it's own file with a 2-clause BSD license
based on a discussion with Wojciech many years ago.

Reviewed by: kevans


# b7fd8611 17-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: sprinkle in const values

Add missing const qualifiers to a number of syscall arguments.

Obtained from: CheriBSD

Reviewed by: kevans


# 01ce7fca 15-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

ommap: fix signed len and pos arguments

4.3 BSD's mmap took an int len and long pos. Reject negative lengths
and in freebsd32 sign-extend pos correctly rather than mis-handling
negative positions as large positive ones.

Reviewed by: kib


# 3225fd22 04-Nov-2021 John Baldwin <jhb@FreeBSD.org>

kern_utimensat: Update name of last arg in prototype.

The last argument is a mask of AT_* flags, not a namei cnp flag as
'int follow' implies in other kern_* functions.

Obtained from: CheriBSD
Sponsored by: The University of Cambridge, Google Inc.


# 0dc332bf 05-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).

fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347


# e884512a 10-Jun-2021 Dmitry Chagin <dchagin@FreeBSD.org>

Split kern_poll() on two counterparts.

The kern_poll_kfds() operates on clear kernel data, kfds points to an
array in the kernel, while kern_poll() operates on user supplied pollfd.
Move nfds check to kern_poll_maxfds().

No functional changes, it's for future use in the Linux emulation layer.

Reviewd by: kib
Differential Revision: https://reviews.freebsd.org/D30690
MFC after: 2 weeks


# 5d1d844a 25-Apr-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

kern_linkat: modify to accept AT_ flags instead of FOLLOW/NOFOLLOW

This makes this API match other kern_xxxat() functions.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D29776


# 7a1591c1 22-Jan-2021 Brooks Davis <brooks@FreeBSD.org>

Rename kern_mmap_req to kern_mmap

Replace all uses of kern_mmap with kern_mmap_req move the old kern_mmap.
Reand rename kern_mmap_req to kern_mmap .

The helper saved some code churn initially, but having multiple
interfaces is sub-optimal.

Obtained from: CheriBSD
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28292


# 7a202823 23-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Expose eventfd in the native API/ABI using a new __specialfd syscall

eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues. Unfortunately, kqueues
are not passable between processes. And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow. This came up when debugging strange HW video decode
failures in Firefox. A native implementation would avoid these problems
and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.

Submitted by: greg@unrelenting.technology
Reviewed by: markj (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26668


# be2535b0 04-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Add kern_ntp_adjtime(9).

Reviewed by: brooks, cy
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27471


# 15eaec6a 21-Nov-2020 Kyle Evans <kevans@FreeBSD.org>

_umtx_op: move compat32 definitions back in

These are reasonably compact, and a future commit will blur the compat32
lines by supporting 32-bit operations with the native _umtx_op.


# de774e42 17-Nov-2020 Conrad Meyer <cem@FreeBSD.org>

linux(4): Implement name_to_handle_at(), open_by_handle_at()

They are similar to our getfhat(2) and fhopen(2) syscalls.

Differential Revision: https://reviews.freebsd.org/D27111


# 63ecb272 16-Nov-2020 Kyle Evans <kevans@FreeBSD.org>

umtx_op: reduce redundancy required for compat32

All of the compat32 variants are substantially the same, save for
copyin/copyout (mostly). Apply the same kind of technique used with kevent
here by having the syscall routines supply a umtx_copyops describing the
operations needed.

umtx_copyops carries the bare minimum needed- size of timespec and
_umtx_time are used for determining if copyout is needed in the sem2_wait
case.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27222


# aaf78c16 23-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Do not leak oldvmspace if image activation failed

and current address space is already destroyed, so kern_execve()
terminates the process.

While there, clean up some internals of post_execve() inlined in init_main.

Reported by: Peter <pmc@citylink.dinoex.sub.org>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26525


# 67a659d2 08-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Add kern_mmap_racct_check(), a helper to verify limits in vm_mmap*().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652


# 52c81be1 20-Jun-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add linux_madvise(2) instead of having Linux apps call the native
FreeBSD madvise(2) directly. While some of the flag values match,
most don't.

PR: kern/230160
Reported by: markj
Reviewed by: markj
Discussed with: brooks, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25272


# 472ced39 12-Apr-2020 Kyle Evans <kevans@FreeBSD.org>

Implement a close_range(2) syscall

close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.

sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.

The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.

This patch is based on a syscall of the same design that is expected to be
merged into Linux.

Reviewed by: kib, markj, vangyzen (all slightly earlier revisions)
Differential Revision: https://reviews.freebsd.org/D21627


# d718de81 04-Mar-2020 Brooks Davis <brooks@FreeBSD.org>

Introduce kern_mmap_req().

This presents an extensible interface to the generic mmap(2)
implementation via a struct pointer intended to use a designated
initializer or compount literal. We take advantage of the mandatory
zeroing of fields not listed in the initializer.

Remove kern_mmap_fpcheck() and use kern_mmap_req().

The motivation for this change is a desire to keep the core
implementation from growing an ever-increasing number of arguments
that must be specified in the correct order for the lowest-level
implementations. In CheriBSD we have already added two more arguments.

Reviewed by: kib
Discussed with: kevans
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D23164


# cbc10891 03-Feb-2020 Dmitry Chagin <dchagin@FreeBSD.org>

For code reuse in Linuxulator rename get_proccess_cputime()
and get_thread_cputime() and add prototypes for it to <sys/syscallsubr.h>.

As both functions become a public interface add process lock assert
to ensure that the process is not exiting under it.

Fix whitespace nit while here.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23340
MFC after 2 weeks


# 7739d927 01-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: replace kern___getcwd with vn_getcwd

The previous routine was resulting in extra data copies most notably in
linux_getcwd.


# b3fb13eb 24-Jan-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_unmount() and use in Linuxulator. No functional changes.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22646


# ca603bb1 12-Jan-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

dd kern_getpriority(), make Linuxulator use it.

Reviewed by: kib, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22842


# 7a0ef283 12-Jan-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_setpriority(), use it in Linuxulator.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22841


# 535b1df9 04-Jan-2020 Kyle Evans <kevans@FreeBSD.org>

shm: correct KPI mistake introduced around memfd_create

When file sealing and shm_open2 were introduced, we should have grown a new
kern_shm_open2 helper that did the brunt of the work with the new interface
while kern_shm_open remains the same. Instead, more complexity was
introduced to kern_shm_open to handle the additional features and consumers
had to keep changing in somewhat awkward ways, and a kern_shm_open2 was
added to wrap kern_shm_open.

Backpedal on this and correct the situation- kern_shm_open returns to the
interface it had prior to file sealing being introduced, and neither
function needs an initial_seals argument anymore as it's handled in
kern_shm_open2 based on the shmflags.


# 18348a23 04-Jan-2020 Kyle Evans <kevans@FreeBSD.org>

kern_mmap: add a variant that allows caller to inspect fp

Linux mmap rejects mmap() on a write-only file with EACCES.
linux_mmap_common currently does a fun dance to grab the fp associated with
the passed in fd, validates it, then drops the reference and calls into
kern_mmap(). Doing so is perhaps both fragile and premature; there's still
plenty of chance for the request to get rejected with a more appropriate
error, and it's prone to a race where the file we ultimately mmap has
changed after it drops its referenced.

This change alleviates the need to do this by providing a kern_mmap variant
that allows the caller to inspect the fp just before calling into the fileop
layer. The callback takes flags, prot, and maxprot as one could imagine
scenarios where any of these, in conjunction with the file itself, may
influence a caller's decision.

The file type check in the linux compat layer has been removed; EINVAL is
seemingly not an appropriate response to the file not being a vnode or
device. The fileop layer will reject the operation with ENODEV if it's not
supported, which more closely matches the common linux description of
mmap(2) return values.

If we discover that we're allowing an mmap() on a file type that Linux
normally wouldn't, we should restrict those explicitly.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22977


# 34ad5ac2 13-Dec-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_kill() and use it in Linuxulator. It's just a cleanup,
no functional changes.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22645


# be2cfdbc 13-Dec-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_getsid() and use it in Linuxulator; no functional changes.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22647


# d6fee74a 12-Dec-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_sync(9), and make kernel code call it instead of going
via sys_sync(2). Minor cleanup, no functional changes.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19366


# 20f70576 25-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

Add a shm_open2 syscall to support upcoming memfd_create

shm_open2 allows a little more flexibility than the original shm_open.
shm_open2 doesn't enforce CLOEXEC on its callers, and it has a separate
shmflag argument that can be expanded later. Currently the only shmflag is
to allow file sealing on the returned fd.

shm_open and memfd_create will both be implemented in libc to use this new
syscall.

__FreeBSD_version is bumped to indicate the presence.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21393


# 0cd95859 25-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

[2/3] Add an initial seal argument to kern_shm_open()

Now that flags may be set on posixshm, add an argument to kern_shm_open()
for the initial seals. To maintain past behavior where callers of
shm_open(2) are guaranteed to not have any seals applied to the fd they're
given, apply F_SEAL_SEAL for existing callers of kern_shm_open. A special
flag could be opened later for shm_open(2) to indicate that sealing should
be allowed.

We currently restrict initial seals to F_SEAL_SEAL. We cannot error out if
F_SEAL_SEAL is re-applied, as this would easily break shm_open() twice to a
shmfd that already existed. A note's been added about the assumptions we've
made here as a hint towards anyone wanting to allow other seals to be
applied at creation.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D21392


# bbbbeca3 24-Jul-2019 Rick Macklem <rmacklem@FreeBSD.org>

Add kernel support for a Linux compatible copy_file_range(2) syscall.

This patch adds support to the kernel for a Linux compatible
copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9).
This syscall/VOP can be used by the NFSv4.2 client to implement the
Copy operation against an NFSv4.2 server to do file copies locally on
the server.
The vn_generic_copy_file_range() function in this patch can be used
by the NFSv4.2 server to implement the Copy operation.
Fuse may also me able to use the VOP_COPY_FILE_RANGE() method.

vn_generic_copy_file_range() attempts to maintain holes in the output
file in the range to be copied, but may fail to do so if the input and
output files are on different file systems with different _PC_MIN_HOLE_SIZE
values.

Separate commits will be done for the generated syscall files and userland
changes. A commit for a compat32 syscall will be done later.

Reviewed by: kib, asomers (plus comments by brooks, jilles)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20584


# 5dc7e31a 02-Jul-2019 Konstantin Belousov <kib@FreeBSD.org>

Control implicit PROT_MAX() using procctl(2) and the FreeBSD note
feature bit.

In particular, allocate the bit to opt-out the image from implicit
PROTMAX enablement. Provide procctl(2) verbs to set and query
implicit PROTMAX handling. The knobs mimic the same per-image flag
and per-process controls for ASLR.

Reviewed by: emaste, markj (previous version)
Discussed with: brooks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D20795


# 77555b84 10-Jun-2019 Doug Moore <dougm@FreeBSD.org>

Change the check for 'size' wrapping around to zero in kern_mmap to account
for both the lower and upper bound modifications. Change the error returned
to ENOMEM. Rename the parameter size to len and make size a local variable
that stores the value of len after it has been modified.

This addresses concerns expressed by Bruce Evans after r348843.

Reported by: brde@optusnet.com.au
Reviewed by: kib, markj (mentors)
MFC after: 3 days
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20592


# a1304030 06-Apr-2019 Mariusz Zaborski <oshogbo@FreeBSD.org>

Introduce funlinkat syscall that always us to check if we are removing
the file associated with the given file descriptor.

Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567


# 318f0d77 06-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Use declared types for caddr_t arguments.

Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17852


# 12e69f96 02-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Add const to input-only char * arguments.

These arguments are mostly paths handled by NAMEI*() macros which already
take const char * arguments.

This change improves the match between syscalls.master and the public
declerations of system calls.

Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17812


# 4f77f488 25-Oct-2018 Konstantin Belousov <kib@FreeBSD.org>

Implement O_BENEATH and AT_BENEATH.

Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory. Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by: jilles (code, previous version of manpages), 0mp (manpages)
Discussed with: allanjude, emaste, jonathan
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17547


# c542c43e 16-Aug-2018 Jamie Gritton <jamie@FreeBSD.org>

Revert r337922, except for some documention-only bits. This needs to wait
until user is changed to stop using jail(2).

Differential Revision: D14791


# 284001a2 16-Aug-2018 Jamie Gritton <jamie@FreeBSD.org>

Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.

Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES). These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do. The first use
is obsolete along with jail(2), but keep them for the second-read-only use.

Differential Revision: D14791


# e8a1ec3e 27-Jun-2018 Ed Maste <emaste@FreeBSD.org>

Split kern_break from sys_break and use it in linuxulator

Previously the linuxulator's linux_brk invoked the FreeBSD sys_break
syscall implementation directly. Instead, move the bulk of the existing
implementation to kern_break, and call that from both sys_break and
linux_brk.

This also addresses a minor bug in linux_brk in that we now return the
actual (rounded up) break address, rather than the requested value.

Reviewed by: brooks (earlier version)
Sponsored by: Turing Robotic Industries
Differential Revision: https://reviews.freebsd.org/D16019


# 34a77b97 27-Mar-2018 Brooks Davis <brooks@FreeBSD.org>

Move uio enums to sys/_uio.h.

Include _uio.h instead of uio.h in several headers to reduce header
polution.

Fix a few places that relied on header polution to get the uio.h header.

I have not moved struct uio as many more things that use it rely on
header polution to get other definitions from uio.h.

Reviewed by: cem, kib, markj
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14811


# b1288166 17-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Use long for the last argument to VOP_PATHCONF rather than a register_t.

pathconf(2) and fpathconf(2) both return a long. The kern_[f]pathconf()
functions now accept a pointer to a long value rather than modifying
td_retval directly. Instead, the system calls explicitly store the
returned long value in td_retval[0].

Requested by: bde
Reviewed by: kib
Sponsored by: Chelsio Communications


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# dd688800 19-Dec-2017 John Baldwin <jhb@FreeBSD.org>

Add a custom VOP_PATHCONF method for fdescfs.

The method handles NAME_MAX and LINK_MAX explicitly. For all other
pathconf variables, the method passes the request down to the underlying
file descriptor. This requires splitting a kern_fpathconf() syscallsubr
routine out of sys_fpathconf(). Also, to avoid lock order reversals with
vnode locks, the fdescfs vnode is unlocked around the call to
kern_fpathconf(), but with the usecount of the vnode bumped.

MFC after: 1 month
Sponsored by: Chelsio Communications


# c4e20cad 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 69921123 23-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Commit the 64-bit inode project.

Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439


# f19351aa 05-May-2017 Brooks Davis <brooks@FreeBSD.org>

Provide a freebsd32 implementation of sigqueue()

The previous misuse of sys_sigqueue() was sending random register or
stack garbage to 64-bit targets. The freebsd32 implementation preserves
the sival_int member of value when signaling a 64-bit process.

Document the mixed ABI implementation of union sigval and the
incompability of sival_ptr with pointer integrity schemes.

Reviewed by: kib, wblock
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10605


# a3b7d0fb 06-Apr-2017 Brooks Davis <brooks@FreeBSD.org>

Regen after r316594.


# 46dc8e9d 30-Mar-2017 Dmitry Chagin <dchagin@FreeBSD.org>

Add kern_mincore() helper for micore() syscall.

Suggested by: kib@
Reviewed by: kib@
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D10143


# 3f8455b0 18-Mar-2017 Eric van Gyzen <vangyzen@FreeBSD.org>

Add clock_nanosleep()

Add a clock_nanosleep() syscall, as specified by POSIX.
Make nanosleep() a wrapper around it.

Attach the clock_nanosleep test from NetBSD. Adjust it for the
FreeBSD behavior of updating rmtp only when interrupted by a signal.
I believe this to be POSIX-compliant, since POSIX mentions the rmtp
parameter only in the paragraph about EINTR. This is also what
Linux does. (NetBSD updates rmtp unconditionally.)

Copy the whole nanosleep.2 man page from NetBSD because it is complete
and closely resembles the POSIX description. Edit, polish, and reword it
a bit, being sure to keep any relevant text from the FreeBSD page.

Reviewed by: kib, ngie, jilles
MFC after: 3 weeks
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D10020


# 3fdcf9ef 13-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Order alphabetically.

Noted by: alc
MFC after: 3 days


# 496ab053 13-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Rework r313352.

Rename kern_vm_* functions to kern_*. Move the prototypes to
syscallsubr.h. Also change Mach VM types to uintptr_t/size_t as
needed, to avoid headers pollution.

Requested by: alc, jhb
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D9535


# 96ee4310 05-Feb-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_cpuset_getaffinity() and kern_cpuset_getaffinity(),
and use it in compats instead of their sys_*() counterparts.

Reviewed by: kib, jhb, dchagin
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9383


# b38b22b0 31-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_pread() and kern_pwrite(), and use it in compats instead
of their sys_*() counterparts. The svr4 is left unchanged.

Reviewed by: kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9379


# ea2ebdc1 31-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_cpuset_getid() and kern_cpuset_setid(), and use them
in compat32 instead of their sub_*() counterparts.

Reviewed by: jhb@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9382


# d293f35c 29-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_listen(), kern_shutdown(), and kern_socket(), and use them
instead of their sys_*() counterparts in various compats. The svr4
is left untouched, because there's no point.

Reviewed by: ed@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9367


# f67d6b5f 29-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_lseek() and use it instead of sys_lseek() in various compats.
I didn't touch svr4/, there's no point.

Reviewed by: ed@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9366


# 142c750a 28-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused kern_sendfile() declaration.


# 34ed0c63 27-Dec-2016 John Baldwin <jhb@FreeBSD.org>

Rename the 'flags' argument to getfsstat() to 'mode' and validate it.

This argument is not a bitmask of flags, but only accepts a single value.
Fail with EINVAL if an invalid value is passed to 'flag'. Rename the
'flags' argument to getmntinfo(3) to 'mode' as well to match.

This is a followup to r308088.

Reviewed by: kib
MFC after: 1 month


# 93d9ebd8 15-Aug-2016 Ed Schouten <ed@FreeBSD.org>

Eliminate use of sys_fsync() and sys_fdatasync().

Make the kern_fsync() function public, so that it can be used by other
parts of the kernel. Fix up existing consumers to make use of it.

Requested by: kib


# 0acf5d0b 25-Feb-2016 Mark Johnston <markj@FreeBSD.org>

Improve error handling for posix_fallocate(2) and posix_fadvise(2).

- Set td_errno so that ktrace and dtrace can obtain the syscall error
number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
not intended to be returned to userland.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425


# e26f6b5f 11-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Add support for anonymous kqueues.

CloudABI's polling system calls merge the concept of one-shot polling
(poll, select) and stateful polling (kqueue). They share the same data
structures.

Extend FreeBSD's kqueue to provide support for waiting for events on an
anonymous kqueue. Unlike stateful polling, there is no need to support
timeouts, as an additional timer event could be used instead.
Furthermore, it makes no sense to use a different number of input and
output kevents. Merge this into a single argument.

Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3307


# aa04a06d 11-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Introduce kern_cap_rights_limit().

The existing sys_cap_rights_limit() expects that a cap_rights_t object
lives in userspace. It is therefore hard to call into it from
kernelspace.

Move the interesting bits of sys_cap_rights_limit() into
kern_cap_rights_limit(), so that we can call into it from the CloudABI
compatibility layer.

Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3314


# a2034cc9 05-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Allow the creation of kqueues with a restricted set of Capsicum rights.

On CloudABI we want to create file descriptors with just the minimal set
of Capsicum rights in place. The reason for this is that it makes it
easier to obtain uniform behaviour across different operating systems.

By explicitly whitelisting the operations, we can return consistent
error codes, but also prevent applications from depending OS-specific
behaviour.

Extend kern_kqueue() to take an additional struct filecaps that is
passed on to falloc_caps(). Update the existing consumers to pass in
NULL.

Differential Revision: https://reviews.freebsd.org/D3259


# 7ee1b208 01-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Add kern_shm_open().

This allows you to specify the capabilities that the new file descriptor
should have. This allows us to create shared memory objects that only
have the rights we're interested in.

The idea behind restricting the rights is that it makes it a lot easier
for CloudABI to get consistent behaviour across different operating
systems. We only need to make sure that a shared memory implementation
consistently implements the operations that are whitelisted.

Approved by: kib
Obtained from: https://github.com/NuxiNL/freebsd


# 8328babd 29-Jul-2015 Ed Schouten <ed@FreeBSD.org>

Make pipes in CloudABI work.

Summary:
Pipes in CloudABI are unidirectional. The reason for this is that
CloudABI attempts to provide a uniform runtime environment across
different flavours of UNIX.

Instead of implementing a custom pipe that is unidirectional, we can
simply reuse Capsicum permission bits to support this. This is nice,
because CloudABI already attempts to restrict permission bits to
correspond with the operations that apply to a certain file descriptor.

Replace kern_pipe() and kern_pipe2() by a single kern_pipe() that takes
a pair of filecaps. These filecaps are passed to the newly introduced
falloc_caps() function that creates the descriptors with rights in
place.

Test Plan:
CloudABI pipes seem to be created with proper rights in place:

https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/unistd/pipe_test.c#L44

Reviewers: jilles, mjg

Reviewed By: mjg

Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D3236


# ea566832 10-Jul-2015 Ed Schouten <ed@FreeBSD.org>

Add missing const keyword to kern_sigaction()'s 'act' parameter.

This structure is not modified by the function. Also add const to
sigact_flag_test(), as it is called by kern_sigaction().


# 5fe97c20 10-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

fd: split kern_dup flags argument into actual flags and a mode

Tidy up the code inside to switch on the mode.


# 2491302a 09-Jul-2015 Ed Schouten <ed@FreeBSD.org>

Add implementations for some of the CloudABI file descriptor system calls.

All of the CloudABI system calls that operate on file descriptors of an
arbitrary type are prefixed with fd_. This change adds wrappers for
most of these system calls around their FreeBSD equivalents.

The dup2() system call present on CloudABI deviates from POSIX, in the
sense that it can only be used to replace existing file descriptor. It
cannot be used to create new ones. The reason for this is that this is
inherently thread-unsafe. Furthermore, there is no need on CloudABI to
use fixed file descriptor numbers. File descriptors 0, 1 and 2 have no
special meaning.

This change exposes the kern_dup() through <sys/syscallsubr.h> and puts
the FDDUP_* flags in <sys/filedesc.h>. It then adds a new flag,
FDDUP_MUSTREPLACE to force that file descriptors are replaced -- not
allocated.

Differential Revision: https://reviews.freebsd.org/D3035
Reviewed by: mjg


# 7236f2c2 24-May-2015 Dmitry Chagin <dchagin@FreeBSD.org>

For future use in the Linuxulator:

1. Add a kern_kqueue() counterpart for kqueue() with flags parameter.

2. Be a bit secure. To avoid a double fp lookup add a kern_kevent_fp()
counterpart for kern_kevent() with file pointer parameter instead
of file descriptor an pass the buck to it.

Suggested by: mjg [2]

Differential Revision: https://reviews.freebsd.org/D1091
Reviewed by: trasz


# a93e83c8 24-May-2015 Dmitry Chagin <dchagin@FreeBSD.org>

In preparation for switching linuxulator to the use the native 1:1
threads split sys_sched_getparam(), sys_sched_setparam(),
sys_sched_getscheduler(), sys_sched_setscheduler() to their kern_*
counterparts and add targettd parameter to allow specify the target
thread directly by callee.

Differential Revision: https://reviews.freebsd.org/D1034
Reviewed by: trasz


# 1aa90eca 24-May-2015 Dmitry Chagin <dchagin@FreeBSD.org>

In preparation for switching linuxulator to the use the native 1:1
threads refactor kern_sched_rr_get_interval() and sys_sched_rr_get_interval().
Add a kern_sched_rr_get_interval() counterpart which takes a targettd
parameter to allow specify target thread directly by callee (new Linuxulator).

Linuxulator temporarily uses first thread in proc.

Move linux_sched_rr_get_interval() to the MI part.

Differential Revision: https://reviews.freebsd.org/D1032
Reviewed by: trasz


# 09baafb4 24-May-2015 Dmitry Chagin <dchagin@FreeBSD.org>

In preparation for switching linuxulator to the use the native 1:1
threads introduce kern_thr_alloc() which will be used later in the
linux_clone().

Differential Revision: https://reviews.freebsd.org/D1029
Reviewed by: trasz


# 95be6d2b 24-May-2015 Dmitry Chagin <dchagin@FreeBSD.org>

In preparation for switching linuxulator to the use the native 1:1
threads split sys_thr_exit() up into sys_thr_exit() and kern_thr_exit().
Move
Where the second will be used in linux_exit() system call later.

Differential Revision: https://reviews.freebsd.org/D1028
Reviewed by: trasz


# 6289b482 21-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Modify kern___getcwd() to take max pathlen limit as an additional
argument. This will be used for the Linux emulation layer - for Linux,
PATH_MAX is 4096 and not 1024.

Differential Revision: https://reviews.freebsd.org/D2335
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 1c73bcab 15-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Rewrite linprocfs_domtab() as a wrapper around kern_getfsstat(). This
adds missing jail and MAC checks.

Differential Revision: https://reviews.freebsd.org/D2193
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 2205e0d1 23-Jan-2015 Jilles Tjoelker <jilles@FreeBSD.org>

Add futimens and utimensat system calls.

The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.

A new UTIME_* constant might allow setting birthtimes in future.

Differential Revision: https://reviews.freebsd.org/D1426
Submitted by: pluknet (partially)
Reviewed by: delphij, pluknet, rwatson
Relnotes: yes


# 9f7a06f2 04-Jan-2015 Dmitry Chagin <dchagin@FreeBSD.org>

Indeed, instead of hiding the kern___getcwd() bug by bogus cast
in r276564, change path type to char * (pathnames are always char *).
And remove bogus casts of malloc().
kern___getcwd() internally doesn't actually use or support u_char *
paths, except to copy them to a normal char * path.

These changes are not visible to libc as libc/gen/getcwd.c misdeclares
__getcwd() as taking a plain char * path.

While here remove _SYS_SYSPROTO_H_ for __getcwd() syscall as
we always have sysproto.h.

Pointed out by: bde

MFC after: 1 week


# 6e646651 13-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 186d9c34 12-Nov-2014 Dmitry Chagin <dchagin@FreeBSD.org>

Add the ppoll() system call.
Export kern_poll() needed by an upcoming Linuxulator change.

Differential Revision: https://reviews.freebsd.org/D1133
Reviewed by: kib, wblock
MFC after: 1 month


# 07b384cb 21-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

Eliminate unnecessary memory allocation in sys_getgroups and its ibcs2 counterpart.


# f69261f2 25-Sep-2014 Konstantin Belousov <kib@FreeBSD.org>

Fix fcntl(2) compat32 after r270691. The copyin and copyout of the
struct flock are done in the sys_fcntl(), which mean that compat32 used
direct access to userland pointers.

Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which
performs neccessary userland memory accesses, and use it from both
native and compat32 fcntl syscalls.

Reported by: jhibbits
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# abd386ba 24-Aug-2014 Mateusz Guzik <mjg@FreeBSD.org>

Fix getppid for traced processes.

Traced processes always have the tracer set as the parent.
Utilize proc_realparent to obtain the right process when needed.

Reviewed by: kib
MFC after: 1 week


# e346f8c4 07-Aug-2014 Bjoern A. Zeeb <bz@FreeBSD.org>

Split up sys_ktimer_getoverrun() into a sys_ and a kern_ variant
and export the kern_ version needed by an upcoming linuxolator change.

MFC after: 3 days
Sponsored by: DARPA,AFRL


# 55648840 19-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.

Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month


# b12698e1 18-Sep-2013 Roman Divacky <rdivacky@FreeBSD.org>

Revert r255672, it has some serious flaws, leaking file references etc.

Approved by: re (delphij)


# 253c75c0 18-Sep-2013 Roman Divacky <rdivacky@FreeBSD.org>

Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue
to implement epoll subset of functionality. The kqueue user data are 32bit
on i386 which is not enough for epoll user data so this patch overrides
kqueue fileops to maintain enough space in struct file.

Initial patch developed by me in 2007 and then extended and finished
by Yuri Victorovich.

Approved by: re (delphij)
Sponsored by: Google Summer of Code
Submitted by: Yuri Victorovich <yuri at rawbw dot com>
Tested by: Yuri Victorovich <yuri at rawbw dot com>


# 0dac22d8 18-Aug-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Implement 32bit versions of the cap_ioctls_limit(2) and cap_ioctls_get(2)
system calls as unsigned longs have different size on i386 and amd64.

Reported by: jilles
Sponsored by: The FreeBSD Foundation


# 643ee871 21-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Implement compat32 wrappers for the ktimer_* syscalls.

Reported, reviewed and tested by: Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# d31e4b3a 20-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

id_t is 64bit, provide the compat32 wrapper for clock_getcpuclockid2(2).

Reported and tested by: Petr Salinger <Petr.Salinger@seznam.cz>
PR: threads/180652
Sponsored by: The FreeBSD Foundation


# da7d2afb 01-May-2013 Jilles Tjoelker <jilles@FreeBSD.org>

Add accept4() system call.

The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.


# d289dc7b 31-Mar-2013 Jilles Tjoelker <jilles@FreeBSD.org>

Rename do_pipe() to kern_pipe2() and declare it properly.


# e2d55f48 15-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Move the definition of the idtype_t from sys/types.h to sys/wait.h.
Fix the bug, use #if __BSD_VISIBLE instead of #if defined(__BSD_VISIBLE),
since __BSD_VISIBLE is always defined.
Reformat the comments from the Solaris style to KNF.

Reported and reviewed by: bde
MFC after: 28 days


# 43bdcf93 15-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Alphabetically reorder the forward-declarations of the structures.
Add the declaration for enum idtype, to be used later.

Reported and reviewed by: bde
MFC after: 28 days


# f13b5a0f 12-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month


# 76dcec5d 24-May-2012 Gleb Kurtsou <gleb@FreeBSD.org>

Add kern_fhstat(), adjust sys_fhstat() to use it.

Extend kern_getdirentries() to accept uio segflag and optionally return
buffer residue.

Sponsored by: Google Summer of Code 2011


# 7edec621 14-Nov-2011 John Baldwin <jhb@FreeBSD.org>

- Split out a kern_posix_fadvise() from the posix_fadvise() system call so
it can be used by in-kernel consumers.
- Make kern_posix_fallocate() public.
- Use kern_posix_fadvise() and kern_posix_fallocate() to implement the
freebsd32 wrappers for the two system calls.


# 7332c129 01-Apr-2011 Konstantin Belousov <kib@FreeBSD.org>

Add support for executing the FreeBSD 1/i386 a.out binaries on amd64.

In particular:
- implement compat shims for old stat(2) variants and ogetdirentries(2);
- implement delivery of signals with ancient stack frame layout and
corresponding sigreturn(2);
- implement old getpagesize(2);
- provide a user-mode trampoline and LDT call gate for lcall $7,$0;
- port a.out image activator and connect it to the build as a module
on amd64.

The changes are hidden under COMPAT_43.

MFC after: 1 month


# 86665509 30-Mar-2011 Konstantin Belousov <kib@FreeBSD.org>

Provide compat32 shims for kldstat(2).

Requested and tested by: jpaetzel
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# fc0de8f0 30-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Move prototypes for kern_sigtimedwait() and kern_sigprocmask() to
<sys/syscallsubr.h> where all other kern_<syscall> prototypes live.


# e268f54c 11-Jan-2010 Kirk McKusick <mckusick@FreeBSD.org>

Background:

When renaming a directory it passes through several intermediate
states. First its new name will be created causing it to have two
names (from possibly different parents). Next, if it has different
parents, its value of ".." will be changed from pointing to the old
parent to pointing to the new parent. Concurrently, its old name
will be removed bringing it back into a consistent state. When fsck
encounters an extra name for a directory, it offers to remove the
"extraneous hard link"; when it finds that the names have been
changed but the update to ".." has not happened, it offers to rewrite
".." to point at the correct parent. Both of these changes were
considered unexpected so would cause fsck in preen mode or fsck in
background mode to fail with the need to run fsck manually to fix
these problems. Fsck running in preen mode or background mode now
corrects these expected inconsistencies that arise during directory
rename. The functionality added with this update is used by fsck
running in background mode to make these fixes.

Solution:

This update adds three new fsck sysctl commands to support background
fsck in correcting expected inconsistencies that arise from incomplete
directory rename operations. They are:

setcwd(dirinode) - set the current directory to dirinode in the
filesystem associated with the snapshot.
setdotdot(oldvalue, newvalue) - Verify that the inode number for ".."
in the current directory is oldvalue then change it to newvalue.
unlink(nameptr, oldvalue) - Verify that the inode number associated
with nameptr in the current directory is oldvalue then unlink it.

As with all other fsck sysctls, these new ones may only be used by
processes with appropriate priviledge.

Reported by: jeff
Security issues: rwatson


# 7e767511 19-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198508, r198509:
Reimplement pselect() in kernel, making change of sigmask and sleep atomic.

MFC r198538:
Move pselect(3) man page to section 2.


# 3134e115 19-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198506:
In kern_sigsuspend(), manipulate thread signal mask using
kern_sigprocmask(). Also, do cursig/postsig loop immediately after
waiting for signal, repeating the wait if wakeup was spurious due to
race with other thread fetching signal from the process queue before us.

MFC r199136:
Use cpu_set_syscall_retval(9) to set syscall result, and return
EJUSTRETURN from kern_sigsuspend() to prevent syscall return code from
modifying wrong frame.
Take care of possibility that pending SIGCONT might be cancelled by
SIGSTOP, causing postsig() not to deliver any catched signal.


# 066d836b 27-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Current pselect(3) is implemented in usermode and thus vulnerable to
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.

Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 84440afb 27-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

In kern_sigsuspend(), better manipulate thread signal mask using
kern_sigprocmask() to properly notify other possible candidate threads
for signal delivery.

Since sigsuspend() shall only return to usermode after a signal was
delivered, do cursig/postsig loop immediately after waiting for
signal, repeating the wait if wakeup was spurious due to race with
other thread fetching signal from the process queue before us. Add
thread_suspend_check() call to allow the thread to be stopped or killed
while in loop.

Modify last argument of kern_sigprocmask() from boolean to flags,
allowing the function to be called with locked proc. Convertion of the
callers that supplied 1 to the old argument will be done in the next
commit, and due to SIGPROCMASK_OLD value equial to 1, code is formally
correct in between.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 9f1fab50 16-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197049:
Calculate the amount of bytes to copy for select filedescriptor masks
taking into account size of fd_set for the current process ABI.

Approved by: re (kensmith)


# b55ef216 09-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

kern_select(9) copies fd_set in and out of userspace in quantities of
longs. Since 32bit processes longs are 4 bytes, 64bit kernel may copy in
or out 4 bytes more then the process expected.

Calculate the amount of bytes to copy taking into account size of fd_set
for the current process ABI.

Diagnosed and tested by: Peter Jeremy <peterjeremy acm org>
Reviewed by: jhb
MFC after: 1 week


# c3889811 08-Jul-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

There is an optimization in chmod(1), that makes it not to call chmod(2)
if the new file mode is the same as it was before; however, this
optimization must be disabled for filesystems that support NFSv4 ACLs.
Chmod uses pathconf(2) to determine whether this is the case - however,
pathconf(2) always follows symbolic links, while the 'chmod -h' doesn't.

This change adds lpathconf(3) to make it possible to solve that problem
in a clean way.

Reviewed by: rwatson (earlier version)
Approved by: re (kib)


# 4202e1be 30-May-2009 Dmitry Chagin <dchagin@FreeBSD.org>

Split native socketpair() syscall onto kern_socketpair() which should
be used by kernel consumers and socketpair() itself.

Approved by: kib (mentor)
MFC after: 1 month


# 0304c731 27-May-2009 Jamie Gritton <jamie@FreeBSD.org>

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# b38ff370 29-Apr-2009 Jamie Gritton <jamie@FreeBSD.org>

Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by: bz (mentor)


# 0eee862a 20-Feb-2009 Ed Schouten <ed@FreeBSD.org>

Don't make Linux stat() open character devices to resolve its name.

The existing code calls kern_open() to resolve the vnode of a pathname
right after a stat(). This is not correct, because it causes random
character devices to be opened in /dev. This means ls'ing a tape
streamer will cause it to rewind, for example. Changes I have made:

- Add kern_statat_vnhook() to allow binary emulators to `post-process'
struct stat, using the proper vnode.

- Remove unneeded printf's from stat() and statfs().

- Make the Linuxolator use kern_statat_vnhook(), replacing
translate_path_major_minor_at().

- Let translate_fd_major_minor() use vp->v_rdev instead of
vp->v_un.vu_cdev.

Result:

crw-rw-rw- 1 root root 0, 14 Feb 20 13:54 /dev/ptmx
crw--w---- 1 root adm 136, 0 Feb 20 14:03 /dev/pts/0
crw--w---- 1 root adm 136, 1 Feb 20 14:02 /dev/pts/1
crw--w---- 1 ed tty 136, 2 Feb 20 14:03 /dev/pts/2

Before this commit, ptmx also had a major number of 136, because it
silently allocated and deallocated a pseudo-terminal. Device nodes that
cannot be opened now have proper major/minor-numbers.

Reviewed by: kib, netchild, rdivacky (thanks!)


# ab0d10f6 11-Nov-2008 Ed Schouten <ed@FreeBSD.org>

Several cleanups related to pipe(2).

- Use `fildes[2]' instead of `*fildes' to make more clear that pipe(2)
fills an array with two descriptors.

- Remove EFAULT from the manual page. Because of the current calling
convention, pipe(2) raises a segmentation fault when an invalid
address is passed.

- Introduce kern_pipe() to make it easier for binary emulations to
implement pipe(2).

- Make Linux binary emulation use kern_pipe(), which means we don't have
to recover td_retval after calling the FreeBSD system call.

Approved by: rdivacky
Discussed on: arch


# 63f8fe9e 22-Oct-2008 John Baldwin <jhb@FreeBSD.org>

Split the copyout of *base at the end of getdirentries() out leaving the
rest in kern_getdirentries(). Use kern_getdirentries() to implement
freebsd32_getdirentries(). This fixes a bug where calls to getdirentries()
in 32-bit binaries would trash the 4 bytes after the 'long base' in
userland.

Submitted by: ups
MFC after: 1 week


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 48b05c3f 08-Apr-2008 Konstantin Belousov <kib@FreeBSD.org>

Implement the linux syscalls
openat, mkdirat, mknodat, fchownat, futimesat, fstatat, unlinkat,
renameat, linkat, symlinkat, readlinkat, fchmodat, faccessat.

Submitted by: rdivacky
Sponsored by: Google Summer of Code 2007
Tested by: pho


# e4193f25 30-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Implement the
openat(2), faccessat(2), fchmodat(2), fchownat(2), fstatat(2),
futimesat(2), linkat(2), mkdirat(2), mkfifoat(2), mknodat(2),
readlinkat(2), renameat(2), symlinkat(2)
syscalls.

Based on the submission by rdivacky,
sponsored by Google Summer of Code 2007
Reviewed by: rwatson, rdivacky
Tested by: pho


# 5f56182b 12-Feb-2008 Ruslan Ermilov <ru@FreeBSD.org>

Change readlink(2)'s return type and type of the last argument
to match POSIX.

Prodded by: Alexey Lyashkov


# e4650294 07-Jan-2008 John Baldwin <jhb@FreeBSD.org>

Make ftruncate a 'struct file' operation rather than a vnode operation.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.

Submitted by: rwatson


# a66fde8d 07-Jun-2007 John Baldwin <jhb@FreeBSD.org>

- Remove unused variable from create_thread().
- Move kern_thr_*() prototype to <sys/syscallsubr.h> where all the other
kern_*() prototypes live.


# 4e4de5e4 20-Dec-2006 Jung-uk Kim <jkim@FreeBSD.org>

MFP4: (part of) 110058

copyin()/copyout() for message type is separated from msgsnd()/msgrcv() and
it is done from its wrapper functions to support 32-bit emulations. After I
implemented this, I have briefly referenced NetBSD and Darwin. NetBSD passes
copyin()/copyout() function pointers from wrappers. Darwin passes size of
message type as an argument, which is actually similar to my first
implementation (P4 109706). We may revisit these implementations later.


# f30e89ce 27-Jul-2006 John Baldwin <jhb@FreeBSD.org>

Fix a file descriptor race I reintroduced when I split accept1() up into
kern_accept() and accept1(). If another thread closed the new file
descriptor and the first thread later got an error trying to copyout the
socket address, then it would attempt to close the wrong file object. To
fix, add a struct file ** argument to kern_accept(). If it is non-NULL,
then on success kern_accept() will store a pointer to the new file object
there and not release any of the references. It is up to the calling code
to drop the references appropriately (including a call to fdclose() in case
of error to safely handle the aforementioned race). While I'm at it, go
ahead and fix the svr4 streams code to not leak the accept fd if it gets an
error trying to copyout the streams structures.


# c870740e 10-Jul-2006 John Baldwin <jhb@FreeBSD.org>

- Split out kern_accept(), kern_getpeername(), and kern_getsockname() for
use by ABI emulators.
- Alter the interface of kern_recvit() somewhat. Specifically, go ahead
and hard code UIO_USERSPACE in the uio as that's what all the callers
specify. In place, add a new uioseg to indicate what type of pointer
is in mp->msg_name. Previously it was always a userland address, but
ABI emulators may pass in kernel-side sockaddrs. Also, remove the
namelenp field and instead require the two places that used it to
explicitly copy mp->msg_namelen out to userland.
- Use the patched kern_recvit() to replace svr4_recvit() and the stock
kern_sendit() to replace svr4_sendit().
- Use kern_bind() instead of stackgap use in ti_bind().
- Use kern_getpeername() and kern_getsockname() instead of stackgap in
svr4_stream_ti_ioctl().
- Use kern_connect() instead of stackgap in svr4_do_putmsg().
- Use kern_getpeername() and kern_accept() instead of stackgap in
svr4_do_getmsg().
- Retire the stackgap from SVR4 compat as it is no longer used.


# d9f46233 08-Jul-2006 John Baldwin <jhb@FreeBSD.org>

- Split ioctl() up into ioctl() and kern_ioctl(). The kern_ioctl() assumes
that the 'data' pointer is already setup to point to a valid KVM buffer
or contains the copied-in data from userland as appropriate (ioctl(2)
still does this). kern_ioctl() takes care of looking up a file pointer,
implementing FIONCLEX and FIOCLEX, and calling fi_ioctl().
- Use kern_ioctl() to implement xenix_rdchk() instead of using the stackgap
and mark xenix_rdchk() MPSAFE.


# c1cccebe 08-Jul-2006 John Baldwin <jhb@FreeBSD.org>

Add a kern_close() so that the ABIs can close a file descriptor w/o having
to populate a close_args struct and change some of the places that do.


# b1ee5b65 08-Jul-2006 John Baldwin <jhb@FreeBSD.org>

Rework kern_semctl a bit to always assume the UIO_SYSSPACE case. This
mostly consists of pushing a few copyin's and copyout's up into
__semctl() as all the other callers were already doing the UIO_SYSSPACE
case. This also changes kern_semctl() to set the return value in a passed
in pointer to a register_t rather than td->td_retval[0] directly so that
callers can only set td->td_retval[0] if all the various copyout's succeed.

As a result of these changes, kern_semctl() no longer does copyin/copyout
(except for GETALL/SETALL) so simplify the locking to acquire the semakptr
mutex before the MAC check and hold it all the way until the end of the
big switch statement. The GETALL/SETALL cases have to temporarily drop it
while they do copyin/malloc and copyout. Also, simplify the SETALL case to
remove handling for a non-existent race condition.


# 3cb83e71 06-Jul-2006 John Baldwin <jhb@FreeBSD.org>

Add kern_setgroups() and kern_getgroups() and use them to implement
ibcs2_[gs]etgroups() rather than using the stackgap. This also makes
ibcs2_[gs]etgroups() MPSAFE. Also, it cleans up one bit of weirdness in
the old setgroups() where it allocated an entire credential just so it had
a place to copy the group list into. Now setgroups just allocates a
NGROUPS_MAX array on the stack that it copies into and then passes to
kern_setgroups().


# 49d409a1 27-Jun-2006 John Baldwin <jhb@FreeBSD.org>

- Add a kern_semctl() helper function for __semctl(). It accepts a pointer
to a copied-in copy of the 'union semun' and a uioseg to indicate which
memory space the 'buf' pointer of the union points to. This is then used
in linux_semctl() and svr4_sys_semctl() to eliminate use of the stackgap.
- Mark linux_ipc() and svr4_sys_semsys() MPSAFE.


# d5388587 13-Jun-2006 John Baldwin <jhb@FreeBSD.org>

- Add a kern_kldload() that is most of the previous kldload() and push
Giant down in it.
- Push Giant down in kern_kldunload() and reorganize it slightly to avoid
using gotos. Also, expose this function to the rest of the kernel.


# fa545f43 28-Feb-2006 Paul Saab <ps@FreeBSD.org>

Fix 32bit sendfile by implementing kern_sendfile so that it takes
the header and trailers as iovec arguments instead of copying them
in inside of sendfile.

Reviewed by: jhb
MFC after: 3 weeks


# 809f984b 06-Feb-2006 John Baldwin <jhb@FreeBSD.org>

Add a kern_eaccess() function and use it to implement xenix_eaccess()
rather than kern_access().

Suggested by: rwatson


# ecc44de7 31-Oct-2005 Paul Saab <ps@FreeBSD.org>

Reformat socket control messages on input/output for 32bit compatibility
on 64bit systems.

Submitted by: ps, ups
Reviewed by: jhb


# a372f822 14-Oct-2005 Paul Saab <ps@FreeBSD.org>

Implement the 32bit versions of recvmsg, recvfrom, sendmsg

Partially obtained from: jhb


# f0b479cd 14-Oct-2005 Paul Saab <ps@FreeBSD.org>

Implement 32bit wrappers for clock_gettime, clock_settime, and
clock_getres.


# bcd9e0dd 07-Jul-2005 John Baldwin <jhb@FreeBSD.org>

- Add two new system calls: preadv() and pwritev() which are like readv()
and writev() except that they take an additional offset argument and do
not change the current file position. In SAT speak:
preadv:readv::pread:read and pwritev:writev::pwrite:write.
- Try to reduce code duplication some by merging most of the old
kern_foov() and dofilefoo() functions into new dofilefoo() functions
that are called by kern_foov() and kern_pfoov(). The non-v functions
now all generate a simple uio on the stack from the passed in arguments
and then call kern_foov(). For example, read() now just builds a uio and
calls kern_readv() and pwrite() just builds a uio and calls kern_pwritev().

PR: kern/80362
Submitted by: Marc Olzheim marcolz at stack dot nl (1)
Approved by: re (scottl)
MFC after: 1 week


# 3a996d6e 11-Jun-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Do not allocate memory based on not-checked argument from userland.
It can be used to panic the kernel by giving too big value.
Fix it by moving allocation and size verification into kern_getfsstat().
This even simplifies kern_getfsstat() consumers, but destroys symmetry -
memory is allocated inside kern_getfsstat(), but has to be freed by the
caller.

Found by: FreeBSD Kernel Stress Test Suite: http://www.holm.cc/stress/
Reported by: Peter Holm <peter@holm.cc>


# 13a82b96 09-Jun-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Avoid code duplication in serval places by introducing universal
kern_getfsstat() function.

Obtained from: jhb


# efe5beca 03-Jun-2005 Paul Saab <ps@FreeBSD.org>

Wrap copyin/copyout for kevent so the 32bit wrapper does not have
to malloc nchanges * sizeof(struct kevent) AND/OR nevents *
sizeof(struct kevent) on every syscall.

Glanced at by: peter, jmg
Obtained from: Yahoo!
MFC after: 2 weeks


# b88ec951 31-Mar-2005 John Baldwin <jhb@FreeBSD.org>

Implement kern_adjtime(), kern_readv(), kern_sched_rr_get_interval(),
kern_settimeofday(), and kern_writev() to allow for further stackgap
reduction in the compat ABIs.


# c1aa81b6 01-Mar-2005 Paul Saab <ps@FreeBSD.org>

regen


# 1a88a252 13-Feb-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Backout previous change (disabling of security checks for signals delivered
in emulation layers), since it appears to be too broad.

Requested by: rwatson


# d8ff44b7 13-Feb-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Split out kill(2) syscall service routine into user-level and kernel part, the
former is callable from user space and the latter from the kernel one. Make
kernel version take additional argument which tells if the respective call
should check for additional restrictions for sending signals to suid/sugid
applications or not.

Make all emulation layers using non-checked version, since signal numbers in
emulation layers can have different meaning that in native mode and such
protection can cause misbehaviour.

As a result remove LIBTHR from the signals allowed to be delivered to a
suid/sugid application.

Requested (sorta) by: rwatson
MFC after: 2 weeks


# fee4a6af 07-Feb-2005 John Baldwin <jhb@FreeBSD.org>

Implement a kern_pathconf() wrapper for pathconf() which can take the
filename from either a user space or a kernel space pointer.


# 76951d21 07-Feb-2005 John Baldwin <jhb@FreeBSD.org>

- Tweak kern_msgctl() to return a copy of the requested message queue id
structure in the struct pointed to by the 3rd argument for IPC_STAT and
get rid of the 4th argument. The old way returned a pointer into the
kernel array that the calling function would then access afterwards
without holding the appropriate locks and doing non-lock-safe things like
copyout() with the data anyways. This change removes that unsafeness and
resulting race conditions as well as simplifying the interface.
- Implement kern_foo wrappers for stat(), lstat(), fstat(), statfs(),
fstatfs(), and fhstatfs(). Use these wrappers to cut out a lot of
code duplication for freebsd4 and netbsd compatability system calls.
- Add a new lookup function kern_alternate_path() that looks up a filename
under an alternate prefix and determines which filename should be used.
This is basically a more general version of linux_emul_convpath() that
can be shared by all the ABIs thus allowing for further reduction of
code duplication.


# a6886ef1 30-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Extend kern_sendit() to take another enum uio_seg argument, which specifies
where the buffer to send lies and use it to eliminate yet another stackgap
in linuxlator.

MFC after: 2 weeks


# 610ecfe0 29-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

o Split out kernel part of execve(2) syscall into two parts: one that
copies arguments into the kernel space and one that operates
completely in the kernel space;

o use kernel-only version of execve(2) to kill another stackgap in
linuxlator/i386.

Obtained from: DragonFlyBSD (partially)
MFC after: 2 weeks


# f4b6eb04 25-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Split out kernel side of msgctl(2) into two parts: the first that pops data
from the userland and pushes results back and the second which does
actual processing. Use the latter to eliminate stackgap in the linux wrapper
of that syscall.

MFC after: 2 weeks


# 23af91dc 25-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

More kern_{get,set}itiver() where they belong.

Submitted by: dwmalone
MFC after: 2 weeks


# efa42cbc 19-Jan-2005 Paul Saab <ps@FreeBSD.org>

move kern_nanosleep to sys/syscallsubr.h

Requested by: jhb


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# c8837938 05-Jan-2005 John Baldwin <jhb@FreeBSD.org>

- Move the function prototypes for kern_setrlimit() and kern_wait() to
sys/syscallsubr.h where all the other kern_foo() prototypes live.
- Resort kern_execve() while I'm there.


# 84e0b075 07-Oct-2004 David Xu <davidxu@FreeBSD.org>

Add an execve command for kse_thr_interrupt to allow libpthread to
restore signal mask correctly, this is required by POSIX.

Reviewed by: deischen


# 78c85e8d 05-Oct-2004 John Baldwin <jhb@FreeBSD.org>

Rework how we store process times in the kernel such that we always store
the raw values including for child process statistics and only compute the
system and user timevals on demand.

- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.

Submitted by: bde (mostly)
MFC after: 1 month


# 8914e6f4 24-Sep-2004 John Baldwin <jhb@FreeBSD.org>

Sort forward declared structures.


# e140eb43 17-Jul-2004 David Malone <dwmalone@FreeBSD.org>

Add a kern_setsockopt and kern_getsockopt which can read the option
values from either user land or from the kernel. Use them for
[gs]etsockopt and to clean up some calls to [gs]etsockopt in the
Linux emulation code that uses the stackgap.


# 2332251c 04-Nov-2003 Max Khon <fjoe@FreeBSD.org>

Back out the following revisions:

1.36 +73 -60 src/sys/compat/linux/linux_ipc.c
1.83 +102 -48 src/sys/kern/sysv_shm.c
1.8 +4 -0 src/sys/sys/syscallsubr.h

That change was intended to support vmware3, but
wantrem parameter is useless because vmware3 uses SYSV shared memory
to talk with X server and X server is native application.
The patch worked because check for wantrem was not valid
(wantrem and SHMSEG_REMOVED was never checked for SHMSEG_ALLOCATED segments).

Add kern.ipc.shm_allow_removed (integer, rw) sysctl (default 0) which when set
to 1 allows to return removed segments in
shm_find_segment_by_shmid() and shm_find_segment_by_shmidx().

MFC after: 1 week


# 710c5645 05-May-2003 David Malone <dwmalone@FreeBSD.org>

Split sendit into two parts. The first part, still called sendit, that
does the copyin stuff and then calls the second part kern_sendit to do
the hard work. Don't bother holding Giant during the copyin phase.

The intent of this is to allow the Linux emulator to impliment send*
syscalls without using the stackgap.


# f130dcf2 05-May-2003 Martin Blapp <mbr@FreeBSD.org>

Change the semantics of sysv shm emulation to take a additional
argument to the functions shm{at,ctl}1 and shm_find_segment_by_shmid{x}.
The BSD semantics didn't allow the usage of shared segment after
being marked for removal through IPC_RMID.

The patch involves the following functions:
- shmat
- shmctl
- shm_find_segment_by_shmid
- shm_find_segment_by_shmidx
- linux_shmat
- linux_shmctl

Submitted by: Orlando Bassotto <orlando.bassotto@ieo-research.it>
Reviewed by: marcel


# e77daab1 18-Apr-2003 John Baldwin <jhb@FreeBSD.org>

Rename do_sigprocmask() to kern_sigprocmask() and make it a global symbol
so that it can be used by binary emulators.


# 12e4397e 03-Feb-2003 Hajimu UMEMOTO <ume@FreeBSD.org>

Break out the bind and connect syscalls to intend to make calling
these syscalls internally easy.
This is preparation for force coming IPv6 support for Linuxlator.

Submitted by: dwmalone
MFC after: 10 days


# 23eeeff7 25-Oct-2002 Peter Wemm <peter@FreeBSD.org>

Split 4.x and 5.x signal handling so that we can keep 4.x signal
handling clean and functional as 5.x evolves. This allows some of the
nasty bandaids in the 5.x codepaths to be unwound.

Encapsulate 4.x signal handling under COMPAT_FREEBSD4 (there is an
anti-foot-shooting measure in place, 5.x folks need this for a while) and
finish encapsulating the older stuff under COMPAT_43. Since the ancient
stuff is required on alpha (longjmp(3) passes a 'struct osigcontext *'
to the current sigreturn(2), instead of the 'ucontext_t *' that sigreturn
is supposed to take), add a compile time check to prevent foot shooting
there too. Add uniform COMPAT_43 stubs for ia64/sparc64/powerpc.

Tested on: i386, alpha, ia64. Compiled on sparc64 (a few days ago).
Approved by: re


# 012e544f 04-Sep-2002 Ian Dowse <iedowse@FreeBSD.org>

Split up ptrace() into a wrapper that does the copying to and from
user space and a kern_ptrace() implementation. Use the kern_*()
version in the Linux emulation code to remove more stack gap uses.

Approved by: des


# 48b52b7a 02-Sep-2002 Ian Dowse <iedowse@FreeBSD.org>

Split up __getcwd so that kernel callers of the internal version
can specify whether the buffer is in user or system space.


# 49c2ff15 02-Sep-2002 Ian Dowse <iedowse@FreeBSD.org>

Split fcntl() into a wrapper and a kernel-callable kern_fcntl()
implementation. The wrapper is responsible for copying additional
structure arguments (struct flock) to and from userland.


# 8f19eb88 01-Sep-2002 Ian Dowse <iedowse@FreeBSD.org>

Split out a number of mostly VFS and signal related syscalls into
a kernel-internal kern_*() version and a wrapper that is called via
the syscall vector table. For paths and structure pointers, the
internal version either takes a uio_seg parameter or requires the
caller to copyin() the data to kernel memory as appropiate. This
will permit emulation layers to use these syscalls without having
to copy out translated arguments to the stack gap.

Discussed on: -arch
Review/suggestions: bde, jhb, peter, marcel