History log of /freebsd-current/sys/kern/kern_exec.c
Revision Date Author Comments
# 0cd9cde7 06-Apr-2024 Jake Freeland <jfree@FreeBSD.org>

ktrace: Record namei violations with KTR_CAPFAIL

Report namei path lookups while Capsicum violation tracing with
CAPFAIL_NAMEI. vfs caching is also ignored when tracing to mimic
capability mode behavior.

Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D40680


# 4a69fc16 07-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

Add membarrier(2)

This is an attempt at clean-room implementation of the Linux'
membarrier(2) syscall. For documentation, you would need to read
both membarrier(2) Linux man page, the comments in Linux
kernel/sched/membarrier.c implementation and possibly look at
actual uses.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32360


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 94426d21 30-May-2023 Jessica Clarke <jrtc27@FreeBSD.org>

pmc: Rework PROCEXEC event to support PIEs

Currently the PROCEXEC event only reports a single address, entryaddr,
which is the entry point of the interpreter in the typical dynamic case,
and used solely to calculate the base address of the interpreter. For
PDEs this is fine, since the base address is known from the program
headers, but for PIEs the base address varies at run time based on where
the kernel chooses to load it, and so pmcstat has no way of knowing the
real address ranges for the executable. This was less of an issue in the
past since PIEs were rare, but now they're on by default on 64-bit
architectures it's more of a problem.

To solve this, pass through what was picked for et_dyn_addr by the
kernel, and use that as the offset for the executable's start address
just as is done for everything in the kernel. Since we're changing this
interface, sanitise the way we determine the interpreter's base address
by passing it through directly rather than indirectly via the entry
point and having to subtract off whatever the ELF header's e_entry is
(and anything that wants the entry point in future can still add that
back on as needed; this merely changes the interface to directly provide
the underlying variables involved).

This will be followed up by a bump to the pmc major version.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D39595


# d706d02e 29-May-2023 Dmitry Chagin <dchagin@FreeBSD.org>

sysentvec: Retire sv_imgact_try as unneeded anymore

The sysentvec sv_imgact_try was used by kern_exec() to allow
non-native ABI to fixup shell path according to ABI root directory.
Since the non-native ABI can now specify its root directory directly
to namei() via pwd_altroot() call this facility is not needed anymore.

Differential Revision: https://reviews.freebsd.org/D40092
MFC after: 2 month


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 5eeb4f73 17-Nov-2022 Doug Rabson <dfr@FreeBSD.org>

imgact_binmisc: Optionally pre-open the interpreter vnode

This allows the use of chroot and/or jail environments which depend on
interpreters registed with imgact_binmisc to use emulator binaries from
the host to emulate programs inside the chroot.

Reviewed by: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D37432


# 5b5b7e2c 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: always retain path buffer after lookup

This removes some of the complexity needed to maintain HASBUF and
allows for removing injecting SAVENAME by filesystems.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D36542


# 5e5675cb 12-Aug-2022 Konstantin Belousov <kib@FreeBSD.org>

Remove struct proc p_singlethr member

It does not serve any purpose after we stopped doing
thread_single(SINGLE_ALLPROC) from stoppable user processes.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36207


# 939f0b63 10-May-2022 Kornel Dulęba <kd@FreeBSD.org>

Implement shared page address randomization

It used to be mapped at the top of the UVA.
If the randomization is enabled any address above .data section will be
randomly chosen and a guard page will be inserted in the shared page
default location.
The shared page is now mapped in exec_map_stack, instead of
exec_new_vmspace. The latter function is called before image activator
has a chance to parse ASLR related flags.
The KERN_PROC_VM_LAYOUT sysctl was extended to provide shared page
address.
The feature is enabled by default for 64 bit applications on all
architectures.
It can be toggled kern.elf64.aslr.shared_page sysctl.

Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35349


# 361971fb 02-Jun-2022 Kornel Dulęba <kd@FreeBSD.org>

Rework how shared page related data is stored

Store the shared page address in struct vmspace.
Also instead of storing absolute addresses of various shared page
segments save their offsets with respect to the shared page address.
This will be more useful when the shared page address is randomized.

Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35393


# 4493a13e 15-May-2022 Konstantin Belousov <kib@FreeBSD.org>

Do not single-thread itself when the process single-threaded some another process

Since both self single-threading and remote single-threading rely on
suspending the thread doing thread_single(), it cannot be mixed: thread
doing thread_suspend_switch() might be subject to thread_suspend_one()
and vice versa.

In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310


# bb92cd7b 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)


# 773fa8cd 25-Jan-2022 Kyle Evans <kevans@FreeBSD.org>

execve: disallow argc == 0

The manpage has contained the following verbiage on the matter for just
under 31 years:

"At least one argument must be present in the array"

Previous to this version, it had been prefaced with the weakening phrase
"By convention."

Carry through and document it the rest of the way. Allowing argc == 0
has been a source of security issues in the past, and it's hard to
imagine a valid use-case for allowing it. Toss back EINVAL if we ended
up not copying in any args for *execve().

The manpage change can be considered "Obtained from: OpenBSD"

Reviewed by: emaste, kib, markj (all previous version)
Differential Revision: https://reviews.freebsd.org/D34045


# 1811c1e9 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

exec: Reimplement stack address randomization

The approach taken by the stack gap implementation was to insert a
random gap between the top of the fixed stack mapping and the true top
of the main process stack. This approach was chosen so as to avoid
randomizing the previously fixed address of certain process metadata
stored at the top of the stack, but had some shortcomings. In
particular, mlockall(2) calls would wire the gap, bloating the process'
memory usage, and RLIMIT_STACK included the size of the gap so small
(< several MB) limits could not be used.

There is little value in storing each process' ps_strings at a fixed
location, as only very old programs hard-code this address; consumers
were converted decades ago to use a sysctl-based interface for this
purpose. Thus, this change re-implements stack address randomization by
simply breaking the convention of storing ps_strings at a fixed
location, and randomizing the location of the entire stack mapping.
This implementation is simpler and avoids the problems mentioned above,
while being unlikely to break compatibility anywhere the default ASLR
settings are used.

The kern.elfN.aslr.stack_gap sysctl is renamed to kern.elfN.aslr.stack,
and is re-enabled by default.

PR: 260303
Reviewed by: kib
Discussed with: emaste, mw
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# 758d98de 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

exec: Remove the stack gap implementation

ASLR stack randomization will reappear in a forthcoming commit. Rather
than inserting a random gap into the stack mapping, the entire stack
mapping itself will be randomized in the same way that other mappings
are when ASLR is enabled.

No functional change intended, as the stack gap implementation is
currently disabled by default.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# 706f4a81 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

exec: Introduce the PROC_PS_STRINGS() macro

Rather than fetching the ps_strings address directly from a process'
sysentvec, use this macro. With stack address randomization the
ps_strings address is no longer fixed.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# 1544f5ad 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

Revert "kern_exec: Add kern.stacktop sysctl."

The current ASLR stack gap feature will be removed, and with that the
need for the kern.stacktop sysctl is gone. All consumers have been
removed.

This reverts commit a97d697122da2bfb0baae5f0939d118d119dae33.

Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# f04a0960 30-Dec-2021 Mark Johnston <markj@FreeBSD.org>

exec: Simplify sv_copyout_strings implementations a bit

Simplify control flow around handling of the execpath length and signal
trampoline. Cache the sysentvec pointer in a local variable.

No functional change intended.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33703


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# be10c0a9 03-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

fexecve(2): allow O_PATH file descriptors opened without O_EXEC

This improves compatibility with Linux.

Noted by: Drew DeVault <sir@cmpwn.com>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32821


# e4ce23b2 03-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

fexecve(2): restore the attempts to calculate the executable path

vn_fullpath() call was not converted to pass newtextvp, instead it used
imgp->vp which is still NULL there. As result vn_fullpath() always
returned EINVAL and execpath was recorded from the value of arg0.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 1c696903 20-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

Unmap shared page manually before doing vm_map_remove() on exit or exec

This allows the pmap_remove(min, max) call to see empty pmap and exploit
empty pmap optimization.

Reviewed by: markj
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32569


# 351d5f7f 23-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

exec: store parent directory and hardlink name of the binary in struct proc

While doing it, also move all the code to resolve pathnames and obtain
text vp and dvp, into single place. Besides simplifying the code, it
avoids spurious vnode relocks and validates the explanation why
a transient text reference on the script vnode is not harmful.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611


# 0c10648f 22-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

exec: provide right hardlink name in AT_EXECPATH

For this, use vn_fullpath_hardlink() to resolve executable name for
execve(2).

This should provide the right hardlink name, used for execution, instead
of random hardlink pointing to this binary. Also this should make the
AT_EXECNAME reliable for execve(2), since kernel only needs to resolve
parent directory path, which should always succeed (except pathological
cases like unlinking a directory).

PR: 248184
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611


# 15bf81f3 23-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

struct image_params: use bool type for boolean members

Also re-align comments, and group booleans and char members together.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611


# 9d58243f 23-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

do_execve(): switch boolean locals to use bool type

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611


# 143dba3a 22-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

kern_exec.c: style

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611


# 46dd801a 16-Oct-2021 Colin Percival <cperciva@FreeBSD.org>

Add userland boot profiling to TSLOG

On kernels compiled with 'options TSLOG', record for each process ID:
* The timestamp of the fork() which creates it and the parent
process ID,
* The first path passed to execve(), if any,
* The first path resolved by namei, if any, and
* The timestamp of the exit() which terminates the process.

Expose this information via a new sysctl, debug.tslog_user.

On kernels lacking 'options TSLOG' (the default), no information is
recorded and the sysctl does not exist.

Note that recording namei is needed in order to obtain the names of
rc.d scripts being launched, as the rc system sources them in a
subshell rather than execing the scripts.

With this commit it is now possible to generate flamecharts of the
entire boot process from the start of the loader to the end of
/etc/rc. The code needed to perform this processing is currently
found in github: https://github.com/cperciva/freebsd-boot-profiling

Reviewed by: mhorne
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D32493


# a97d6971 13-Oct-2021 Dawid Gorecki <dgr@semihalf.com>

kern_exec: Add kern.stacktop sysctl.

With stack gap enabled top of the stack is moved down by a random
amount of bytes. Because of that some multithreaded applications
which use kern.usrstack sysctl to calculate address of stacks for
their threads can fail. Add kern.stacktop sysctl, which can be used
to retrieve address of the stack after stack gap is applied to it.
Returns value identical to kern.usrstack for processes which have
no stack gap.

Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31897


# 889b56c8 13-Oct-2021 Dawid Gorecki <dgr@semihalf.com>

setrlimit: Take stack gap into account.

Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.

Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.

PR: 253208
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31516


# a0558fe9 28-Apr-2021 Mateusz Guzik <mjg@FreeBSD.org>

Retire code added to support CloudABI

CloudABI was removed in cf0ee8738e31aa9e6fbf4dca4dac56d89226a71a


# b5cadc64 04-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

Make core dump writes interruptible with SIGKILL

This can be disabled by sysctl kern.core_dump_can_intr

Reported and tested by: pho
Reviewed by: imp, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32313


# 63cb9308 01-Oct-2021 Justin Hibbits <jhibbits@FreeBSD.org>

Fix segment size in compressing core dumps

A core segment is bounded in size only by memory size. On 64-bit
architectures this means a segment can be much larger than 4GB.
However, compress_chunk() takes only a u_int, clamping segment size to
4GB-1, resulting in a truncated core. Everything else, including the
compressor internally, uses size_t, so use size_t at the boundary here.

This dates back to the original refactor back in 2015 (r279801 /
aa14e9b7).

MFC after: 1 week
Sponsored by: Juniper Networks, Inc.


# 397f1889 12-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

Remove SV_CAPSICUM

It was only needed for cloudabi

Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D31923


# b7924341 27-Aug-2021 Andrew Turner <andrew@FreeBSD.org>

Create sys/reg.h for the common code previously in machine/reg.h

Move the common kernel function signatures from machine/reg.h to a new
sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2).

Reviewed by: imp, markj
Sponsored by: DARPA, AFRL (original work)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19830


# af29f399 28-Jul-2021 Dmitry Chagin <dchagin@FreeBSD.org>

umtx: Split umtx.h on two counterparts.

To prevent umtx.h polluting by future changes split it on two headers:
umtx.h - ABI header for userspace;
umtxvar.h - the kernel staff.

While here fix umtx_key_match style.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31248
MFC after: 2 weeks


# 5fd9cd53 20-Jul-2021 Dmitry Chagin <dchagin@FreeBSD.org>

linux(4): Modify sv_onexec hook to return an error.

Temporary add stubs to the Linux emulation layer which calls the existing hook.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D30911
MFC after: 2 weeks


# 62ba4cd3 20-Jul-2021 Dmitry Chagin <dchagin@FreeBSD.org>

Call sv_onexec hook after the process VA is created.

For future use in the Linux emulation layer call sv_onexec hook right after
the new process address space is created. It's safe, as sv_onexec used only
by Linux abi and linux_on_exec() does not depend on a state of process VA.

Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D30899
MFC after: 2 weeks


# 28a66fc3 01-Jul-2021 Konstantin Belousov <kib@FreeBSD.org>

Do not call FreeBSD-ABI specific code for all ABIs

Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle
robust mutexes, for native FreeBSD ABI. Similarly, there is no sense
in calling sigfastblock_clear() for non-native ABIs.

Requested by: dchagin
Reviewed by: dchagin, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987


# 71ab3445 01-Jul-2021 Konstantin Belousov <kib@FreeBSD.org>

Add sv_onexec_old() sysent hook for exec event

Unlike sv_onexec(), it is called from the old (pre-exec) sysentvec structure.
The old vmspace for the process is still intact during the call.

Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987


# db8d680e 01-Jul-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

procctl(2): add PROC_NO_NEW_PRIVS_CTL, PROC_NO_NEW_PRIVS_STATUS

This introduces a new, per-process flag, "NO_NEW_PRIVS", which
is inherited, preserved on exec, and cannot be cleared. The flag,
when set, makes subsequent execs ignore any SUID and SGID bits,
instead executing those binaries as if they not set.

The main purpose of the flag is implementation of Linux
PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged
chroot.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30939


# 615f22b2 29-Jun-2021 Dmitry Chagin <dchagin@FreeBSD.org>

Add a link to the Elf_Brandinfo into the struc proc.

To allow the ABI to make a dicision based on the Brandinfo add a link
to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits
of Elf_Brandinfo flags is private to the ABI.

Note to MFC: it breaks KBI.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D30918
MFC after: 2 weeks


# 2d423f76 14-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

sysent: allow ABI to disable setid on exec.

Reviewed by: dchagin
Tested by: trasz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28154


# 19e6043a 14-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

kern_exec.c: Add execve_nosetid() helper

Reviewed by: dchagin
Tested by: trasz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28154


# 3b9971c8 25-May-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

Clean up some of the core dumping code.

No functional changes.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30397


# 154f0ecc 22-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

Fix tinderbox build after 1762f674ccb571e6 ktrace commit.


# 1762f674 14-May-2021 Konstantin Belousov <kib@FreeBSD.org>

ktrace: pack all ktrace parameters into allocated structure ktr_io_params

Ref-count the ktr_io_params structure instead of vnode/cred.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30257


# 33621dfc 22-May-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

Refactor core dumping code a bit

This makes it possible to use core_write(), core_output(),
and sbuf_drain_core_output(), in Linux coredump code. Moving
them out of imgact_elf.c is necessary because of the weird way
it's being built.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30369


# f1c3adef 13-Apr-2021 Mark Johnston <markj@FreeBSD.org>

execve: Mark exec argument buffers

We cache mapped execve argument buffers to avoid the overhead of TLB
shootdowns. Mark them invalid when they are freed to the cache.

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29460


# 4ea65707 11-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

exec_new_vmspace: print useful error message on ctty if stack cannot be mapped.

After old vmspace is destroyed during execve(2), but before the new space
is fully constructed, an error during image activation cannot be returned
because there is no executing program to receive it.

In the relatively common case of failure to map stack, print some hints
on the control terminal. Note that user has enough knobs to cause stack
mapping error, and this is the most common reason for execve(2) aborting
the process.

Requested by: jhb
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050


# 2e1c94aa 08-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

Implement enforcing write XOR execute mapping policy.

It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.

Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050


# 673e2dd6 18-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Add ELF flag to disable ASLR stack gap.

Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().

PR: 239873
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 87a9b18d 23-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Provide ABI modules hooks for process exec/exit and thread exit.

Exec and exit are same as corresponding eventhandler hooks.

Thread exit hook is called somewhat earlier, while thread is still
owned by the process and enough context is available. Note that the
process lock is owned when the hook is called.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27309


# e68c6191 21-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Stop using eventhandlers for itimers subsystem exec and exit hooks.

While there, do some minor cleanup for kclocks. They are only
registered from kern_time.c, make registration function static.
Remove event hooks, they are not used by both registered kclocks.
Add some consts.

Perhaps we can stop registering kclocks at all and statically
initialize them.

Reviewed by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27305


# 74a093eb 21-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Stop using eventhandler to invoke umtx_exec hook.

There is no point in dynamic registration, umtx hook is there always.

Reviewed by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27303


# 85078b85 17-Nov-2020 Conrad Meyer <cem@FreeBSD.org>

Split out cwd/root/jail, cmask state from filedesc table

No functional change intended.

Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).

__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.

Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037


# f7db0c95 04-Nov-2020 Mark Johnston <markj@FreeBSD.org>

vmspace: Convert to refcount(9)

This is mostly mechanical except for vmspace_exit(). There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace. In that case, upon switching to
vmspace0 we can unconditionally release the reference.

Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.

Reviewed by: alc, kib, mmel
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27057


# 275c821d 24-Oct-2020 Kyle Evans <kevans@FreeBSD.org>

audit: correct reporting of *execve(2) success

r326145 corrected do_execve() to return EJUSTRETURN upon success so that
important registers are not clobbered. This had the side effect of tapping
out 'failures' for all *execve(2) audit records, which is less than useful
for auditing purposes.

Audit exec returns earlier, where we can know for sure that EJUSTRETURN
translates to success. Note that this unsets TDP_AUDITREC as we commit the
audit record, so the usual audit in the syscall return path will do nothing.

PR: 249179
Reported by: Eirik Oeverby <ltning-freebsd anduin net>
Reviewed by: csjp, kib
MFC after: 1 week
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26922


# 5dca94ee 23-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Remove pointless local variable.

Reported by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 6 days


# aaf78c16 23-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Do not leak oldvmspace if image activation failed

and current address space is already destroyed, so kern_execve()
terminates the process.

While there, clean up some internals of post_execve() inlined in init_main.

Reported by: Peter <pmc@citylink.dinoex.sub.org>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26525


# feabaaf9 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the always curthread argument from reverse lookup routines

Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by: pho


# d8bc2a17 15-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

fd: remove fd_lastfile

It keeps recalculated way more often than it is needed.

Provide a routine (fdlastfile) to get it if necessary.

Consumers may be better off with a bitmap iterator instead.


# 698dda67 02-May-2020 Conrad Meyer <cem@FreeBSD.org>

kern_exec.c: Produce valid code ifndef SYS_PROTO_H

Reported by: Coccinelle


# b24e6ac8 16-Apr-2020 Brooks Davis <brooks@FreeBSD.org>

Convert canary, execpathp, and pagesizes to pointers.

Use AUXARGS_ENTRY_PTR to export these pointers. This is a followup to
r359987 and r359988.

Reviewed by: jhb
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24446


# 9df1c38b 15-Apr-2020 Brooks Davis <brooks@FreeBSD.org>

Export argc, argv, envc, envv, and ps_strings in auxargs.

This simplifies discovery of these values, potentially with reducing the
number of syscalls we need to make at runtime. Longer term, we wish to
convert the startup process to pass an auxargs pointer to _start() and
use that rather than walking off the end of envv. This is cleaner,
more C-friendly, and for systems with strong bounds (e.g. CHERI)
necessary.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24407


# 397df744 15-Apr-2020 Brooks Davis <brooks@FreeBSD.org>

Make ps_strings in struct image_params into a pointer.

This is a prepratory commit for D24407.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA


# 59838c1a 01-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Retire procfs-based process debugging.

Modern debuggers and process tracers use ptrace() rather than procfs
for debugging. ptrace() has a supserset of functionality available
via procfs and new debugging features are only added to ptrace().
While the two debugging services share some fields in struct proc,
they each use dedicated fields and separate code. This results in
extra complexity to support a feature that hasn't been enabled in the
default install for several years.

PR: 244939 (exp-run)
Reviewed by: kib, mjg (earlier version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D23837


# 8d4d271e 04-Mar-2020 Mateusz Guzik <mjg@FreeBSD.org>

execve: use LOCKSHARED when looking up the interpreter

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23956


# b05ca429 02-Mar-2020 Pawel Biernacki <kaktus@FreeBSD.org>

sys/: Document few more sysctls.

Submitted by: Antranig Vartanian <antranigv@freebsd.am>
Reviewed by: kaktus
Commented by: jhb
Approved by: kib (mentor)
Sponsored by: illuria security
Differential Revision: https://reviews.freebsd.org/D23759


# 7aaf252c 28-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Convert a few triviail consumers to the new unlocked grab API.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23847


# a113b17f 20-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Do not read sigfastblock word on syscall entry.

On machines with SMAP, fueword executes two serializing instructions
which can be seen in microbenchmarks.

As a measure to restore microbenchmark numbers, only read the word on
the attempt to deliver signal in ast(). If the word is set, signal is
not delivered and word is kept, preventing interruption of
interruptible sleeps by signals until userspace calls
sigfastblock(UNBLOCK) which clears the word.

This way, the spurious EINTR that userspace can see while in critical
section is on first interruptible sleep, if a signal is pending, and
on signal posting. It is believed that it is not important for rtld
and lbithr critical sections. It might be visible for the application
code e.g. for the callback of dl_iterate_phdr(3), but again the belief
is that the non-compliance is acceptable. Most important is that the
retry of the sleeping syscall does not interrupt unless additional
signal is posted.

For now I added the knob kern.sigfastblock_fetch_always to enable the
word read on syscall entry to be able to diagnose possible issues due
to spurious EINTR.

While there, do some code restructuting to have all sigfastblock()
handling located in kern_sig.c.

Reviewed by: jeff
Discussed with: mjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D23622


# 146fc63f 09-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Add a way to manage thread signal mask using shared word, instead of syscall.

A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery. Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.

The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).

With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.

The syscall is not exported from the stable libc version namespace on
purpose. It is intended to be used only by our C runtime
implementation internals.

Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773


# 643656cf 31-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace VOP_MARKATIME with VOP_MMAPPED

The routine is only provided by ufs and is only used on mmap and exec.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23422


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# af009714 14-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Handle pagein clustering in vm_page_grab_valid() so that it can be used by
exec_map_first_page(). This will also enable pagein clustering for other
interested consumers (tmpfs, md, etc).

Discussed with: alc
Approved by: kib
Differential Revision: https://reviews.freebsd.org/D22731


# d8010b11 09-Dec-2019 John Baldwin <jhb@FreeBSD.org>

Copy out aux args after the argument and environment vectors.

Partially revert r354741 and r354754 and go back to allocating a
fixed-size chunk of stack space for the auxiliary vector. Keep
sv_copyout_auxargs but change it to accept the address at the end of
the environment vector as an input stack address and no longer
allocate room on the stack. It is now called at the end of
copyout_strings after the argv and environment vectors have been
copied out.

This should fix a regression in r354754 that broke the stack alignment
for newer Linux amd64 binaries (and probably broke Linux arm64 as
well).

Reviewed by: kib
Tested on: amd64 (native, linux64 (only linux-base-c7), and i386)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22695


# 31174518 03-Dec-2019 John Baldwin <jhb@FreeBSD.org>

Use uintptr_t instead of register_t * for the stack base.

- Use ustringp for the location of the argv and environment strings
and allow destp to travel further down the stack for the stackgap
and auxv regions.
- Update the Linux copyout_strings variants to move destp down the
stack as was done for the native ABIs in r263349.
- Stop allocating a space for a stack gap in the Linux ABIs. This
used to hold translated system call arguments, but hasn't been used
since r159992.

Reviewed by: kib
Tested on: md64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22501


# 03b0d68c 18-Nov-2019 John Baldwin <jhb@FreeBSD.org>

Check for errors from copyout() and suword*() in sv_copyout_args/strings.

Reviewed by: brooks, kib
Tested on: amd64 (amd64, i386, linux64), i386 (i386, linux)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22401


# 01a2b567 17-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

kern_exec: p_osrel and p_fctl0 were obliterated by failed execve(2) attempt.

Zeroing of them is needed so that an image activator can update the
values as appropriate (or not set at all).

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22379


# e3532331 15-Nov-2019 John Baldwin <jhb@FreeBSD.org>

Add a sv_copyout_auxargs() hook in sysentvec.

Change the FreeBSD ELF ABIs to use this new hook to copyout ELF auxv
instead of doing it in the sv_fixup hook. In particular, this new
hook allows the stack space to be allocated at the same time the auxv
values are copied out to userland. This allows us to avoid wasting
space for unused auxv entries as well as not having to recalculate
where the auxv vector is by walking back up over the argv and
environment vectors.

Reviewed by: brooks, emaste
Tested on: amd64 (amd64 and i386 binaries), i386, mips, mips64
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D22355


# 0012f373 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(4/6) Protect page valid with the busy lock.

Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594


# 63e97555 14-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

(1/6) Replace busy checks with acquires where it is trival to do so.

This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548


# 55248d32 26-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Fix handling of invalid pages in exec_map_first_page().

exec_map_first_page() would unconditionally free an unbacked, invalid
page from the executable image. However, it is possible that the page
is wired, in which case it is incorrect to free the page, so check for
additional wirings first.

Reported by: syzkaller
Tested by: pho
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21767


# fee2a2fa 09-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Change synchonization rules for vm_page reference counting.

There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486


# 1040254b 07-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

In do_execve(), use shared text vnode lock consistently.

Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21560


# 1c36b728 07-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

In do_execve(), clear imgp->textset when restarting for interpreter.

Otherwise, we might left the boolean set, which would affect cleanup
after an error on interpreter activation.

Reviewed by: markj
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D21560


# fe69291f 03-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

Add procctl(PROC_STACKGAP_CTL)

It allows a process to request that stack gap was not applied to its
stacks, retroactively. Also it is possible to control the gaps in the
process after exec.

PR: 239894
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21352


# b5d239cb 28-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Wire pages in vm_page_grab() when appropriate.

uiomove_object_page() and exec_map_first_page() would previously wire a
page after having grabbed it. Ask vm_page_grab() to perform the wiring
instead: this removes some redundant code, and is cheaper in the case
where the requested page is not resident since the page allocator can be
asked to initialize the page as wired, whereas a separate vm_page_wire()
call requires the page lock.

In vm_imgact_hold_page(), use vm_page_unwire_noq() instead of
vm_page_unwire(PQ_NONE). The latter ensures that the page is dequeued
before returning, but this is unnecessary since vm_page_free() will
trigger a batched dequeue of the page.

Reviewed by: alc, kib
Tested by: pho (part of a larger patch)
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21440


# 01c3ba97 05-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix mis-merge

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# f422bc30 02-Aug-2019 John Baldwin <jhb@FreeBSD.org>

Set ISOPEN in namei flags when opening executable interpreters.

These vnodes are explicitly opened via VOP_OPEN via
exec_check_permissions identical to the main exectuable image.
Setting ISOPEN allows filesystems to perform suitable checks in
VOP_LOOKUP (e.g. close-to-open consistency in the NFS client).

Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D21129


# fc83c5a7 31-Jul-2019 Konstantin Belousov <kib@FreeBSD.org>

Make randomized stack gap between strings and pointers to argv/envs.

This effectively makes the stack base on the csu _start entry
randomized.

The gap is enabled if ASLR is for the ABI is enabled, and then
kern.elf{64,32}.aslr.stack_gap specify the max percentage of the
initial stack size that can be wasted for gap. Setting it to zero
disables the gap, and max is capped at 50%.

Only amd64 for now.

Reviewed by: cem, markj
Discussed with: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21081


# eeacb3b0 08-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Merge the vm_page hold and wire mechanisms.

The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended. __FreeBSD_version is bumped.

Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247


# e2e050c8 19-May-2019 Conrad Meyer <cem@FreeBSD.org>

Extract eventfilter declarations to sys/_eventfilter.h

This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.


# 78022527 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Switch to use shared vnode locks for text files during image activation.

kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923


# 91ff2d48 12-Apr-2019 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unneeded conditionals for sv_ functions - all the ABIs
(apart from null_sysvec) define them, so the 'else' branch is
never taken.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19889


# 6f1fe330 16-Mar-2019 Konstantin Belousov <kib@FreeBSD.org>

amd64: Add md process flags and first P_MD_PTI flag.

PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.

On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process. Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.

MFC note: md_flags change struct proc KBI.

Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514


# fa50a355 10-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

Implement Address Space Layout Randomization (ASLR)

With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.

Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.

The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.

I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.

To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.

The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.

ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.

Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.

PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603


# 6f26dd50 07-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

do_execve(): lock vnode when needed.

Code after exec_fail_dealloc label expects that the image vnode is
locked if present. When copyout() of the strings or auxv vectors fails,
goto to the error handling did not relocked the vnode as required.

The copyout() can be made failing e.g. by creating an ELF image with
PT_GNU_STACK segment disabling the write.

Reported by: Jonathan Stuart <n0t.jcs@gmail.com> (found by fuzzing)
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# cc426dd3 11-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Remove unused argument to priv_check_cred.

Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by: The FreeBSD Foundation


# f373437a 29-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Add helper functions to copy strings into struct image_args.

Given a zeroed struct image_args with an allocated buf member,
exec_args_add_fname() must be called to install a file name (or NULL).
Then zero or more calls to exec_args_add_env() followed by zero or
more calls to exec_args_add_env(). exec_args_adjust_args() may be
called after args and/or env to allow an interpreter to be prepended to
the argument list.

To allow code reuse when adding arg and env variables, begin_envv
should be accessed with the accessor exec_args_get_begin_envv()
which handles the case when no environment entries have been added.

Use these functions to simplify exec_copyin_args() and
freebsd32_exec_copyin_args().

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15468


# f5cf7589 23-Nov-2018 Konstantin Belousov <kib@FreeBSD.org>

Provide storage for the process feature control flags in struct proc.

The flags are cleared on exec, it is up to the image activator to set
them.

Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 12e69f96 02-Nov-2018 Brooks Davis <brooks@FreeBSD.org>

Add const to input-only char * arguments.

These arguments are mostly paths handled by NAMEI*() macros which already
take const char * arguments.

This change improves the match between syscalls.master and the public
declerations of system calls.

Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17812


# 2fe6aeff 17-Aug-2018 Mariusz Zaborski <oshogbo@FreeBSD.org>

capsicum: allow the setproctitle(3) function in capability mode

Capsicum in past allowed to change the process title.
This was broken with r335939.

PR: 230584
Submitted by: Yuichiro NAITO <naito.yuichiro@gmail.com>
Reported by: ian@niw.com.au
MFC after: 1 week


# d92da759 13-Jul-2018 Brooks Davis <brooks@FreeBSD.org>

Round down the location of execpathp to slightly improve copyout speed.

In practice, this moves the padding from below the canary to above
execpathp has no impact on stack consumption.

Submitted by: Wuyang-Chung (via github pull request #159)
MFC after: 1 week


# 2bf95012 05-Jul-2018 Andrew Turner <andrew@FreeBSD.org>

Create a new macro for static DPCPU data.

On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by: bz, emaste, markj
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D16140


# 3e7cb27c 04-Jun-2018 Alan Cox <alc@FreeBSD.org>

Use a single, consistent approach to returning success versus failure in
vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases. Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic. vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value. That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.

Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.

Eliminate a redundant error check from kern_madvise(). Add a comment
explaining where the check is performed.

Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value. Since MADV_FREE is passed as the
behavior, the return value will always be zero.

Reviewed by: kib, markj
MFC after: 7 days


# b8d908b7 01-Jun-2018 Ed Maste <emaste@FreeBSD.org>

ANSIfy sys/kern


# 5f77b8a8 24-May-2018 Brooks Davis <brooks@FreeBSD.org>

Avoid two suword() calls per auxarg entry.

Instead, construct an auxargs array and copy it out all at once.

Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent
the array. This is the correct type where pairs of words just happend
to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act
on this array rather than introducing a new macro.

Return errors on copyout() and suword() failures and handle them in the
caller.

Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64
which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32
(now removed due to the use of proper types).

Reviewed by: kib
Comments from: emaste, jhb
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15485


# cbd92ce6 09-May-2018 Matt Macy <mmacy@FreeBSD.org>

Eliminate the overhead of gratuitous repeated reinitialization of cap_rights

- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.

Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month


# 1302eea7 20-Apr-2018 Konstantin Belousov <kib@FreeBSD.org>

Rename PROC_PDEATHSIG_SET -> PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_GET
-> PROC_PDEATHSIG_STATUS for consistency with other procctl(2)
operations names.

Requested by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 13 days


# 73c8686e 19-Apr-2018 John Baldwin <jhb@FreeBSD.org>

Simplify the code to allocate stack for auxv, argv[], and environment vectors.

Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().

Reviewed by: emaste
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D15123


# b9408863 18-Apr-2018 Konstantin Belousov <kib@FreeBSD.org>

Add PROC_PDEATHSIG_SET to procctl interface.

Allow processes to request the delivery of a signal upon death of
their parent process. Supposed consumer of the feature is PostgreSQL.

Submitted by: Thomas Munro
Reviewed by: jilles, mjg
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D15106


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# b2387aa6 07-Feb-2018 Andriy Gapon <avg@FreeBSD.org>

exec_map_first_page: fix an inverse condition introduced in r254138

While the bug itself was serious, as we could either pass a non-busied
page to vm_pager_get_pages() or leak a busy page, it could only be
triggered under a very rare condition where the page is already inserted
into the object, but it is not valid yet.

Reviewed by: kib
MFC after: 2 weeks


# d821d364 21-Jan-2018 Pedro F. Giffuni <pfg@FreeBSD.org>

Unsign some values related to allocation.

When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after: 3 weeks


# 8a36da99 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 814629dd 24-Nov-2017 Ed Schouten <ed@FreeBSD.org>

Don't let cpu_set_syscall_retval() clobber exec_setregs().

Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).

Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.

Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.

To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.

Reviewed by: kib, jhibbits
Differential Revision: https://reviews.freebsd.org/D13180


# 2ca45184 09-Nov-2017 Matt Joras <mjoras@FreeBSD.org>

Introduce EVENTHANDLER_LIST and some users.

This introduces a facility to EVENTHANDLER(9) for explicitly defining a
reference to an event handler list. This is useful since previously all
invokers of events had to do a locked traversal of the global list of
event handler lists in order to find the appropriate event handler list.
By keeping a pointer to the appropriate list an invoker can avoid this
traversal completely. The pointer is initialized with SYSINIT(9) during
the eventhandler stage. Users registering interest in events do not need
to know if the event is backed by such a list, since the list is added
to the global list of lists. As with lists that are not pre-defined it
is safe to register for the events before the list has been created.

This converts the process_* and thread_* events to using the new
facility, as these are events whose locked traversals end up showing up
significantly in ports build workflows (and presumably other workflows
with many short lived threads/procs). It may be advantageous to convert
other events to using the new facility.

The el_flags field is now unused, but leave it be so that this revision
can be MFC'd.

Reviewed by: bdrewery, markj, mjg
Approved by: rstone (mentor)
In collaboration with: ian
MFC after: 4 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12814


# e6b645ae 18-Oct-2017 Mateusz Guzik <mjg@FreeBSD.org>

execve: avoid one proc lock/unlock trip unless PTRACE_EXEC is set

MFC after: 1 week


# 80a2397a 18-Oct-2017 Mateusz Guzik <mjg@FreeBSD.org>

Tidy up pmc support at execve.

The proc-specific check is inherently racy, so the code can just unlock
beforehand.

MFC after: 1 week


# 467b3975 03-Jul-2017 Konstantin Belousov <kib@FreeBSD.org>

Resolve confusion between different error code spaces.

The vm_map_fixed() and vm_map_stack() VM functions return Mach error
codes. Convert them into errno values before returning result from
exec_new_vmspace().

While there, modernize the comment and do minor style adjustments.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 8056df6e 30-Jun-2017 Alan Cox <alc@FreeBSD.org>

Clear the MAP_WIREFUTURE flag on the vm map in exec_new_vmspace() when it
recycles the current vm space. Otherwise, an mlockall(MCL_FUTURE) could
still be in effect on the process after an execve(2), which violates the
specification for mlockall(2).

It's pointless for vm_map_stack() to check the MEMLOCK limit. It will
never be asked to wire the stack. Moreover, it doesn't even implement
wiring of the stack.

Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11421


# 3e85b721 16-May-2017 Ed Maste <emaste@FreeBSD.org>

Remove register keyword from sys/ and ANSIfy prototypes

A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10193


# c6a4ba5a 14-Feb-2017 Mark Johnston <markj@FreeBSD.org>

Apply MADV_FREE to exec_map entries only after a lowmem event.

This effectively provides the same benefit as applying MADV_FREE inline
upon every execve, since the page daemon invokes lowmem handlers prior to
scanning the inactive queue. It also has less overhead; the cost of
applying MADV_FREE is very noticeable on many-CPU systems since it includes
that of a TLB shootdown of global PTEs. For instance, this change nearly
halves the system CPU usage during a buildkernel on a 128-vCPU EC2
instance (with some other patches applied).

Benchmarked by: cperciva (earlier version)
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9586


# 6e89d383 06-Jan-2017 Konstantin Belousov <kib@FreeBSD.org>

Explicitely add "opt_compat.h" to kern_exec.c: fix powerpc LINT builds.

sys/ptrace.h includes sys/signal.h, which includes sys/_sigset.h.
Note that sys/_sigset.h only defines osigset_t if COMPAT_43 was defined.

Two lines later, sys/ptrace.h includes machine/reg.h, which in case of
powerpc, includes opt_compat.h.

After the include headers reordering in r311345, we have sys/ptrace.h
included before sys/sysproto.h.

If COMPAT_43 was requested in the kernel config, the result is that
sys/_sigset.h does not define osigset_t, but sys/sysproto.h sees
COMPAT_43 and uses osigset_t.

Fix this by explicitely including opt_compat.h to cover the whole
kern/kern_exec.c scope.

Sponsored by: The FreeBSD Foundation


# ec492b13 04-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Add a small allocator for exec_map entries.

Upon each execve, we allocate a KVA range for use in copying data to the
new image. Pages must be faulted into the range, and when the range is
freed, the backing pages are freed and their mappings are destroyed. This
is a lot of needless overhead, and the exec_map management becomes a
bottleneck when many CPUs are executing execve concurrently. Moreover, the
number of available ranges is fixed at 16, which is insufficient on large
systems and potentially excessive on 32-bit systems.

The new allocator reduces overhead by making exec_map allocations
persistent. When a range is freed, pages backing the range are marked clean
and made easy to reclaim. With this change, the exec_map is sized based on
the number of CPUs.

Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8921


# eeeaa7ba 04-Jan-2017 Mark Johnston <markj@FreeBSD.org>

Sort includes in kern_exec.c.

MFC after: 1 week


# 7667839a 15-Nov-2016 Alan Cox <alc@FreeBSD.org>

Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)

Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497


# 53dc58f2 19-Oct-2016 Mateusz Guzik <mjg@FreeBSD.org>

Mark a bunch of mpsafe sysctls as such.

This gives me a sysctl Giant-free buildworld.


# ce3ee09b 14-Aug-2016 Alan Cox <alc@FreeBSD.org>

Eliminate unneeded vm_page_xbusy() and vm_page_xunbusy() operations when
neither vm_pager_has_page() nor vm_pager_get_pages() is called.

Reviewed by: kib, markj
MFC after: 3 weeks


# fc9bbf27 13-Aug-2016 Alan Cox <alc@FreeBSD.org>

Eliminate two calls to vm_page_xunbusy() that are both unnecessary and
incorrect from the error cases in exec_map_first_page(). They are
unnecessary because we automatically unbusy the page in vm_page_free()
when we remove it from the object. The calls are incorrect because they
happen after the page is freed, so we might actually unbusy the page
after it has been reallocated to a different object. (This error was
introduced in r292373.)

Reviewed by: kib
MFC after: 1 week


# 77d68094 18-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

The assertion re-added in r302614 was triggered when stopping signal
is delivered to vforked child. Issue is that we avoid stopping such
children in issignal() to not block parents. But executed AST, which
ignored stops, leaves the child with the signal pending but no AST
pending.

On first exec after vfork(), call signotify() to handle pending
reenabled signals. Adjust the assert to not check vfork children
until exec.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 8d570f64 15-Jul-2016 John Baldwin <jhb@FreeBSD.org>

Add a mask of optional ptrace() events.

ptrace() now stores a mask of optional events in p_ptevents. Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.

Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.

The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.

The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX. PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.

The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.

While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.

Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7044


# 9e590ff0 27-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

When filt_proc() removes event from the knlist due to the process
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.

Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function. The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty. As result,
the knlist_remove_inevent() is no longer needed and removed.

Other changes:

In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications. The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.

Fix immediate note activation in filt_procattach(). Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.

In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken. Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).

Some minor and incomplete style fixes.

Analyzed and tested by: Eric Badger <eric@badgerio.us>
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D6859


# 3fc292d5 07-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Old process credentials for setuid execve must not be dereferenced
when the process credentials were not changed. This can happen if an
error occured trying to activate the setuid binary. And on error, if
new credentials were not yet assigned, they must be freed to not
create the leak.

Use oldcred == NULL as the predicate to detect credential
reassignment.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation


# cda68844 27-May-2016 Mateusz Guzik <mjg@FreeBSD.org>

exec: get rid of one vnode lock/unlock pair in do_execve

The lock was temporarily dropped for vrele calls, but they can be
postponed to a point where the lock is not held in the first place.

While here shuffle other code not needing the lock.


# 1afd78b3 26-May-2016 Bryan Drewery <bdrewery@FreeBSD.org>

exec: Provide execpath in imgp for the process_exec hook.

This was previously set after the hook and only if auxargs were present.
Now always provide it if possible.

MFC after: 2 weeks
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D6546


# 881010f0 26-May-2016 Bryan Drewery <bdrewery@FreeBSD.org>

exec: Add credential change information into imgp for process_exec hook.

This allows an EVENTHANDLER(process_exec) hook to see if the new image
will cause credentials to change whether due to setgid/setuid or because
of POSIX saved-id semantics.

This adds 3 new fields into image_params:
struct ucred *newcred Non-null if the credentials will change.
bool credential_setid True if the new image is setuid or setgid.

This will pre-determine the new credentials before invoking the image
activators, where the process_exec hook is called. The new credentials
will be installed into the process in the same place as before, after
image activators are done handling the image.

MFC after: 2 weeks
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D6544


# e3043798 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes in comments.

No functional change.


# 35030a5d 29-Mar-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove some NULL checks for M_WAITOK allocations.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 3b77ac5f 01-Mar-2016 Bryan Drewery <bdrewery@FreeBSD.org>

Correct a comment.


# 36160958 16-Dec-2015 Mark Johnston <markj@FreeBSD.org>

Fix style issues around existing SDT probes.

- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
set automatically by the SDT framework.

MFC after: 1 week


# b0cd2017 16-Dec-2015 Gleb Smirnoff <glebius@FreeBSD.org>

A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().

o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.

Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.

Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.

This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.

Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix


# e6b95927 06-Oct-2015 Conrad Meyer <cem@FreeBSD.org>

Fix core corruption caused by race in note_procstat_vmmap

This fix is spiritually similar to r287442 and was discovered thanks to
the KASSERT added in that revision.

NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' vm map
via vn_fullpath. As vnodes may move during coredump, this is racy.

We do not remove the race, only prevent it from causing coredump
corruption.

- Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable
kinfo packing for PROCSTAT_VMMAP notes. This avoids VMMAP corruption
and truncation, even if names change, at the cost of up to PATH_MAX
bytes per mapped object. The new sysctl is documented in core.5.

- Fix note_procstat_vmmap to self-limit in the second pass. This
addresses corruption, at the cost of sometimes producing a truncated
result.

- Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste)
to grok the new zero padding.

Reported by: pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3824


# 2f2f522b 27-Sep-2015 Andriy Gapon <avg@FreeBSD.org>

save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE

SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after: 20 days


# bcb60d52 07-Sep-2015 Conrad Meyer <cem@FreeBSD.org>

Follow-up to r287442: Move sysctl to compiled-once file

Avoid duplicate sysctl nodes.

Found by: tijl
Approved by: markj (mentor)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3586


# 39f5ebb7 03-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Add sysent flag to switch to capabilities mode on startup.

CloudABI processes should run in capabilities mode automatically. There
is no need to switch manually (e.g., by calling cap_enter()). Add a
flag, SV_CAPSICUM, that can be used to call into cap_enter() during
execve().

Reviewed by: kib


# b4490c6e 18-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

The si_status field of the siginfo_t, provided by the waitid(2) and
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).

Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig. p_xexit contains complete status
and copied out into si_status.

Requested by: Joerg Schilling
Reviewed by: jilles (previous version), pho
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 457f7e23 16-Jul-2015 Ed Schouten <ed@FreeBSD.org>

Implement CloudABI's exec() call.

Summary:
In a runtime that is purely based on capability-based security, there is
a strong emphasis on how programs start their execution. We need to make
sure that we execute an new program with an exact set of file
descriptors, ensuring that credentials are not leaked into the process
accidentally.

Providing the right file descriptors is just half the problem. There
also needs to be a framework in place that gives meaning to these file
descriptors. How does a CloudABI mail server know which of the file
descriptors corresponds to the socket that receives incoming emails?
Furthermore, how will this mail server acquire its configuration
parameters, as it cannot open a configuration file from a global path on
disk?

CloudABI solves this problem by replacing traditional string command
line arguments by tree-like data structure consisting of scalars,
sequences and mappings (similar to YAML/JSON). In this structure, file
descriptors are treated as a first-class citizen. When calling exec(),
file descriptors are passed on to the new executable if and only if they
are referenced from this tree structure. See the cloudabi-run(1) man
page for more details and examples (sysutils/cloudabi-utils).

Fortunately, the kernel does not need to care about this tree structure
at all. The C library is responsible for serializing and deserializing,
but also for extracting the list of referenced file descriptors. The
system call only receives a copy of the serialized data and a layout of
what the new file descriptor table should look like:

int proc_exec(int execfd, const void *data, size_t datalen, const int *fds,
size_t fdslen);

This change introduces a set of fd*_remapped() functions:

- fdcopy_remapped() pulls a copy of a file descriptor table, remapping
all of the file descriptors according to the provided mapping table.
- fdinstall_remapped() replaces the file descriptor table of the process
by the copy created by fdcopy_remapped().
- fdescfree_remapped() frees the table in case we aborted before
fdinstall_remapped().

We then add a function exec_copyin_data_fds() that builds on top these
functions. It copies in the data and constructs a new remapped file
descriptor. This is used by cloudabi_sys_proc_exec().

Test Plan:
cloudabi-run(1) is capable of spawning processes successfully, providing
it data and file descriptors. procstat -f seems to confirm all is good.
Regular FreeBSD processes also work properly.

Reviewers: kib, mjg

Reviewed By: mjg

Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D3079


# 61617058 13-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

exec: textvp -> oldtextvp; binvp -> newtextvp

This makes it consistent with the rest of the naming in do_execve.

No functional changes.


# 853be5ff 13-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

exec plug a redundant vref + vrele of the image vnode


# 6ef12002 30-Jun-2015 Konstantin Belousov <kib@FreeBSD.org>

Do not calculate the stack's bottom address twice.

Submitted by: Olivц╘r Pintц╘r
Review: https://reviews.freebsd.org/D2953
MFC after: 1 week


# 093c7f39 12-Jun-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 7b445033 10-May-2015 Konstantin Belousov <kib@FreeBSD.org>

On exec, single-threading must be enforced before arguments space is
allocated from exec_map. If many threads try to perform execve(2) in
parallel, the exec map is exhausted and some threads sleep
uninterruptible waiting for the map space. Then, the thread which won
the race for the space allocation, cannot single-thread the process,
causing deadlock.

Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# fe8a824c 23-Apr-2015 Konstantin Belousov <kib@FreeBSD.org>

Handle incorrect ELF images specifying size for PT_GNU_STACK not being
multiple of page size.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 316b3843 15-Apr-2015 Konstantin Belousov <kib@FreeBSD.org>

Implement support for binary to requesting specific stack size for the
initial thread. It is read by the ELF image activator as the virtual
size of the PT_GNU_STACK program header entry, and can be specified by
the linker option -z stack-size in newer binutils.

The soft RLIMIT_STACK is auto-increased if possible, to satisfy the
binary' request.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3d653db0 21-Mar-2015 Alan Cox <alc@FreeBSD.org>

Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected. Previously,
the color setting was delayed until after the virtual address was
selected. In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages. Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory. (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision: https://reviews.freebsd.org/D2013
Reviewed by: jhb, kib
Tested by: Svatopluk Kraus (armv6)
MFC after: 6 weeks


# daf63fd2 15-Mar-2015 Mateusz Guzik <mjg@FreeBSD.org>

cred: add proc_set_cred helper

The goal here is to provide one place altering process credentials.

This eases debugging and opens up posibilities to do additional work when such
an action is performed.


# 677258f7 18-Jan-2015 Konstantin Belousov <kib@FreeBSD.org>

Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger
attachment to the process. Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.

Discussed with: jilles, rwatson
Man page fixes by: rwatson
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 6ddcc233 13-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Add facility to stop all userspace processes. The supposed use of the
feature is to quisce the system before suspend.

Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process. Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.

Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume. It
is used for debugging; should be removed after the real use of the
interface is added.

In collaboration with: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 8a5177cc 31-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

filedesc: fix missed comments about fdsetugidsafety

While here just note that both fdsetugidsafety and fdcheckstd take sleepable
locks.


# 0a2c94b8 28-Oct-2014 Konstantin Belousov <kib@FreeBSD.org>

Replace some calls to fuword() by fueword() with proper error checking.

Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 3 weeks


# 11888da8 21-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

filedesc: cleanup setugidsafety a little

Rename it to fdsetugidsafety for consistency with other functions.

There is no need to take filedesc lock if not closing any files.

The loop has to verify each file and we are guaranteed fdtable has space
for at least 20 fds. As such there is no need to check fd_lastfile.

While here tidy up is_unsafe.


# 5c37b305 20-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

Plug unnecessary binvp NULL initialization and test.

Reported by: Coverity
CID: 1018889


# 8e572983 29-Sep-2014 Mateusz Guzik <mjg@FreeBSD.org>

Use bzero instead of explicitly zeroing stuff in do_execve.

While strictly speaking this is not correct since some fields are pointers,
it makes no difference on all supported archs and we already rely on it doing
the right thing in other places.

No functional changes.


# 70978c93 12-Aug-2014 Konstantin Belousov <kib@FreeBSD.org>

If vm_page_grab() allocates a new page, the page is not inserted into
page queue even when the allocation is not wired. It is
responsibility of the vm_page_grab() caller to ensure that the page
does not end on the vm_object queue but not on the pagedaemon queue,
which would effectively create unpageable unwired page.

In exec_map_first_page() and vm_imgact_hold_page(), activate the page
immediately after unbusying it, to avoid leak.

In the uiomove_object_page(), deactivate page before the object is
unlocked. There is no leak, since the page is deactivated after
uiomove_fromphys() finished. But allowing non-queued non-wired page
in the unlocked object queue makes it impossible to assert that leak
does not happen in other places.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 965d0860 14-Jul-2014 Mateusz Guzik <mjg@FreeBSD.org>

Plug p_pptr null test in do_execve. It is always true.


# 5e2554b7 07-Jul-2014 Mateusz Guzik <mjg@FreeBSD.org>

Don't call crdup nor uifind under vnode lock.

A locked vnode can get into the way of satisyfing malloc with M_WATOK.

This is a fixup to r268087.

Suggested by: kib
MFC after: 1 week


# e7d939bd 06-Jul-2014 Marcel Moolenaar <marcel@FreeBSD.org>

Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan


# a6bad85e 01-Jul-2014 Mateusz Guzik <mjg@FreeBSD.org>

Plug gcc warning after r268074 about unitialized newsigacts

Reported by: Gary Jennejohn <gljennjohn gmail.com>


# 350d5181 01-Jul-2014 Mateusz Guzik <mjg@FreeBSD.org>

Don't call crcopysafe or uifind unnecessarily in execve.

MFC after: 1 week


# d00c8ea4 01-Jul-2014 Mateusz Guzik <mjg@FreeBSD.org>

Perform a lockless check in sigacts_shared.

It is used only during execve (i.e. singlethreaded), so there is no fear
of returning 'not shared' which soon becomes 'shared'.

While here reorganize the code a little to avoid proc lock/unlock in
shared case.

MFC after: 1 week


# b0bc0cad 27-Jun-2014 Mateusz Guzik <mjg@FreeBSD.org>

Call fdcloseexec right after fdunshare.

No functional changes.

MFC after: 1 week


# b9d32c36 27-Jun-2014 Mateusz Guzik <mjg@FreeBSD.org>

Make fdunshare accept only td parameter.

Proc had to match the thread anyway and 2 parameters were inconsistent
with the rest.

MFC after: 1 week


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# e16406c7 26-Jun-2014 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove duplicated includes.

Submitted by: Mariusz Zaborski <oshogbo@FreeBSD.org>


# 78960940 08-Jun-2014 Alan Cox <alc@FreeBSD.org>

Refresh a comment. The VM_STACK option was eliminated in r43209.

Sponsored by: EMC / Isilon Storage Division


# 7032434e 20-May-2014 Konstantin Belousov <kib@FreeBSD.org>

When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace. The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process. The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.

If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec. Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.

To fix, postpone the free of old vmspace until the threads are resumed
and exited. To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.

http://bugs.debian.org/743141

Reported by: Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 88b124ce 18-Mar-2014 Konstantin Belousov <kib@FreeBSD.org>

Make the array pointed to by AT_PAGESIZES auxv properly aligned.

Also, remove the expression which calculated the location of the
strings for a new image and grown over the time to be
non-comprehensible. Instead, calculate the offsets by steps, which
also makes fixing the alignments much cleaner.

Reported and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# d9fae5ab 26-Nov-2013 Andriy Gapon <avg@FreeBSD.org>

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# eda6009c 15-Oct-2013 Konstantin Belousov <kib@FreeBSD.org>

Add a sysctl kern.disallow_high_osrel which disables executing the
images compiled on the world with higher major version number than the
high version number of the booted kernel. Default to disable.

Sponsored by: The FreeBSD Foundation
Discussed with: bapt
MFC after: 1 week


# 7008be5b 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 5944de8e 22-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation


# 7b77e1fe 14-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after: 2 weeks


# c7aebda8 09-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl


# 5df87b21 07-Aug-2013 Jeff Roberson <jeff@FreeBSD.org>

Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# be996836 05-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by: EMC / Isilon storage division
Reported by: kib


# 3b6714ca 04-Aug-2013 Attilio Rao <attilio@FreeBSD.org>

The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff
Tested by: pho


# 27a18d6a 06-Jun-2013 Alan Cox <alc@FreeBSD.org>

Don't busy the page unless we are likely to release the object lock.

Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division


# 1e65d73c 02-Jun-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not map the shared page COW. If the process wired its address
space, fork(2) would cause shadowing of the physical object and
copying of the shared page into private copy, effectively preventing
updates for the exported timehands structure and stopping the clock.

Specify the maximum allowed permissions for the page to be read and
execute, preventing write from the user mode.

Reported and tested by: <huanghwh@yahoo.com>
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 2609222a 01-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# 888d4d4f 07-Feb-2013 Konstantin Belousov <kib@FreeBSD.org>

When vforked child is traced, the debugging events are not generated
until child performs exec(). The behaviour is reasonable when a
debugger is the real parent, because the parent is stopped until
exec(), and sending a debugging event to the debugger would deadlock
both parent and child.

On the other hand, when debugger is not the parent of the vforked
child, not sending debugging signals makes it impossible to debug
across vfork.

Fix the issue by declining generating debug signals only when vfork()
was done and child called ptrace(PT_TRACEME). Set a new process flag
P_PPTRACE from the attach code for PT_TRACEME, if P_PPWAIT flag is
set, which indicates that the process was created with vfork() and
still did not execed. Check P_PPTRACE from issignal(), instead of
refusing the trace outright for the P_PPWAIT case. The scope of
P_PPTRACE is exactly contained in the scope of P_PPWAIT.

Found and tested by: zont
Reviewed by: pluknet
MFC after: 2 weeks


# 140dedb8 02-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem. There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount. Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode. Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode. Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by: pho
MFC after: 3 weeks


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# c331c970 06-Oct-2012 Andriy Gapon <avg@FreeBSD.org>

ktrace/kern_exec: check p_tracecred instead of p_cred

.. when deciding whether to continue tracing across suid/sgid exec.
Otherwise if root ktrace-d an unprivileged process and the processed
exec-ed a suid program, then tracing didn't continue across exec.

Reviewed by: bde, kib
MFC after: 22 days


# 877d24ac 28-Sep-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix the mis-handling of the VV_TEXT on the nullfs vnodes.

If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by: pho (previous version)
MFC after: 2 weeks


# c8e781f6 27-Sep-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Revert r240931, as the previous comment was actually in sync with POSIX.

I have to note that POSIX is simply stupid in how it describes O_EXEC/fexecve
and friends. Yes, not only inconsistent, but stupid.

In the open(2) description, O_RDONLY flag is described as:

O_RDONLY Open for reading only.

Taken from:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/open.html

Note "for reading only". Not "for reading or executing"!

In the fexecve(2) description you can find:

The fexecve() function shall fail if:

[EBADF]
The fd argument is not a valid file descriptor open for executing.

Taken from:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html

As you can see the function shall fail if the file was not open with O_EXEC!

And yet, if you look closer you can find this mess in the exec.html:

Since execute permission is checked by fexecve(), the file description
fd need not have been opened with the O_EXEC flag.

Yes, O_EXEC flag doesn't have to be specified after all. You can open a file
with O_RDONLY and you still be able to fexecve(2) it.


# 3a038c4d 25-Sep-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

We cannot open file for reading and executing (O_RDONLY | O_EXEC).
Well, in theory we can pass those two flags, because O_RDONLY is 0,
but we won't be able to read from a descriptor opened with O_EXEC.

Update the comment.

Sponsored by: FreeBSD Foundation
MFC after: 2 weeks


# 28a7f607 07-Jul-2012 Mateusz Guzik <mjg@FreeBSD.org>

Unbreak handling of descriptors opened with O_EXEC by fexecve(2).

While here return EBADF for descriptors opened for writing (previously it was ETXTBSY).

Add fgetvp_exec function which performs appropriate checks.

PR: kern/169651
In collaboration with: kib
Approved by: trasz (mentor)
MFC after: 1 week


# a665ed98 23-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Move the code dealing with shared page into a dedicated
kern_sharedpage.c source file from kern_exec.c.

MFC after: 29 days


# 21c295ef 23-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Stop updating the struct vdso_timehands from even handler executed in
the scheduled task from tc_windup(). Do it directly from tc_windup in
interrupt context [1].

Establish the permanent mapping of the shared page into the kernel
address space, avoiding the potential need to sleep waiting for
allocation of sf buffer during vdso_timehands update. As a
consequence, shared_page_write_start() and shared_page_write_end()
functions are not needed anymore.

Guess and memorize the pointers to native host and compat32 sysentvec
during initialization, to avoid the need to get shared_page_alloc_sx
lock during the update.

In tc_fill_vdso_timehands(), do not loop waiting for timehands
generation to stabilize, since vdso_timehands is written in the same
interrupt context which wrote timehands.

Requested by: mav [1]
MFC after: 29 days


# aea81038 22-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Implement mechanism to export some kernel timekeeping data to
usermode, using shared page. The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.

The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.

The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.

Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.

Minimal stubs neccessary for non-x86 architectures to still compile
are provided.

Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month


# a9d8437c 22-Jun-2012 Konstantin Belousov <kib@FreeBSD.org>

Enchance the shared page chunk allocator.

Do not rely on the busy state of the page from which we allocate the
chunk, to protect allocator state. Use statically allocated sx lock
instead.

Provide more flexible KPI. In particular, allow to allocate chunk
without providing initial data, and allow writes into existing
allocation. Allow to get an sf buf which temporary maps the chunk, to
allow sequential updates to shared page content without unmapping in
between.

Reviewed by: jhb
Tested by: flo
MFC after: 1 month


# 44ad5475 08-Mar-2012 John Baldwin <jhb@FreeBSD.org>

Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().

MFC after: 2 weeks


# 2974cc36 19-Jan-2012 Konstantin Belousov <kib@FreeBSD.org>

Use shared lock for the executable vnode in the exec path after the
VV_TEXT changes are handled. Assert that vnode is exclusively locked at
the places that modify VV_TEXT.

Discussed with: alc
MFC after: 3 weeks


# ce8bd78b 27-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Do not deliver SIGTRAP on exec as the normal signal, use ptracestop() on
syscall exit path. Otherwise, if SIGTRAP is ignored, that tdsendsignal()
do not want to deliver the signal, and debugger never get a notification
of exec.

Found and tested by: Anton Yuzhaninov <citrin citrin ru>
Discussed with: jhb
MFC after: 2 weeks


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# a9d2f8d8 10-Aug-2011 Robert Watson <rwatson@FreeBSD.org>

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# ff66f6a4 17-Jul-2011 Robert Watson <rwatson@FreeBSD.org>

Define two new sysctl node flags: CTLFLAG_CAPRD and CTLFLAG_CAPRW, which
may be jointly referenced via the mask CTLFLAG_CAPRW. Sysctls with these
flags are available in Capsicum's capability mode; other sysctl nodes are
not.

Flag several useful sysctls as available in capability mode, such as memory
layout sysctls required by the run-time linker and malloc(3). Also expose
access to randomness and available kernel features.

A few sysctls are enabled to support name->MIB conversion; these may leak
information to capability mode by virtue of providing resolution on names
not flagged for access in capability mode. This is, generally, not a huge
problem, but might be something to resolve in the future. Flag these cases
with XXX comments.

Submitted by: jonathan
Sponsored by: Google, Inc.


# 12bc222e 30-Jun-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Add some checks to ensure that Capsicum is behaving correctly, and add some
more explicit comments about what's going on and what future maintainers
need to do when e.g. adding a new operation to a sys_machdep.c.

Approved by: mentor(rwatson), re(bz)


# 7705d4b2 25-Feb-2011 Dmitry Chagin <dchagin@FreeBSD.org>

Introduce preliminary support of the show description of the ABI of
traced process by adding two new events which records value of process
sv_flags to the trace file at process creation/execing/exiting time.

MFC after: 1 Month.


# 6297a3d8 08-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

Create shared (readonly) page. Each ABI may specify the use of page by
setting SV_SHP flag and providing pointer to the vm object and mapping
address. Provide simple allocator to carve space in the page, tailored
to put the code with alignment restrictions.

Enable shared page use for amd64, both native and 32bit FreeBSD
binaries. Page is private mapped at the top of the user address
space, moving a start of the stack one page down. Move signal
trampoline code from the top of the stack to the shared page.

Reviewed by: alc


# d680caab 21-Oct-2010 John Baldwin <jhb@FreeBSD.org>

- When disabling ktracing on a process, free any pending requests that
may be left. This fixes a memory leak that can occur when tracing is
disabled on a process via disabling tracing of a specific file (or if
an I/O error occurs with the tracefile) if the process's next system
call is exit(). The trace disabling code clears p_traceflag, so exit1()
doesn't do any KTRACE-related cleanup leading to the leak. I chose to
make the free'ing of pending records synchronous rather than patching
exit1().
- Move KTRACE-specific logic out of kern_(exec|exit|fork).c and into
kern_ktrace.c instead. Make ktrace_mtx private to kern_ktrace.c as a
result.

MFC after: 1 month


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# de478dd4 30-Aug-2010 Jaakko Heinonen <jh@FreeBSD.org>

execve(2) has a special check for file permissions: a file must have at
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.

Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.

PR: kern/125009
Reviewed by: bde, trasz
Approved by: pjd (ZFS part)


# 79856499 22-Aug-2010 Rui Paulo <rpaulo@FreeBSD.org>

Add an extra comment to the SDT probes definition. This allows us to get
use '-' in probe names, matching the probe names in Solaris.[1]

Add userland SDT probes definitions to sys/sdt.h.

Sponsored by: The FreeBSD Foundation
Discussed with: rwaston [1]


# ee235bef 17-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Supply some useful information to the started image using ELF aux vectors.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.

Tested by: marius (sparc64)
MFC after: 1 month


# a14a9498 27-Jul-2010 Alan Cox <alc@FreeBSD.org>

The interpreter name should no longer be treated as a buffer that can be
overwritten. (This change should have been included in r210545.)

Submitted by: kib


# 2af6e14d 27-Jul-2010 Alan Cox <alc@FreeBSD.org>

Introduce exec_alloc_args(). The objective being to encapsulate the
details of the string buffer allocation in one place.

Eliminate the portion of the string buffer that was dedicated to storing
the interpreter name. The pointer to the interpreter name can simply be
made to point to the appropriate argument string.

Reviewed by: kib


# 9e4e5114 25-Jul-2010 Alan Cox <alc@FreeBSD.org>

Change the order in which the file name, arguments, environment, and
shell command are stored in exec*()'s demand-paged string buffer. For
a "buildworld" on an 8GB amd64 multiprocessor, the new order reduces
the number of global TLB shootdowns by 31%. It also eliminates about
330k page faults on the kernel address space.

Change exec_shell_imgact() to use "args->begin_argv" consistently as
the start of the argument and environment strings. Previously, it
would sometimes use "args->buf", which is the start of the overall
buffer, but no longer the start of the argument and environment
strings. While I'm here, eliminate unnecessary passing of "&length"
to copystr(), where we don't actually care about the length of the
copied string.

Clean up the initialization of the exec map. In particular, use the
correct size for an entry, and express that size in the same way that
is used when an entry is allocated. The old size was one page too
large. (This discrepancy originated in 2004 when I rewrote
exec_map_first_page() to use sf_buf_alloc() instead of the exec map
for mapping the first page of the executable.)

Reviewed by: kib


# 69a8f9e3 23-Jul-2010 Alan Cox <alc@FreeBSD.org>

Eliminate a little bit of duplicated code.


# e113db82 09-Jul-2010 John Baldwin <jhb@FreeBSD.org>

Accidentally committed an older version of this comment rather than the
final one.


# 07b18338 09-Jul-2010 John Baldwin <jhb@FreeBSD.org>

Refine a comment.

Reviewed by: bde


# 41890423 02-Jul-2010 Alan Cox <alc@FreeBSD.org>

Use vm_page_next() instead of vm_page_lookup() in exec_map_first_page()
because vm_page_next() is faster.


# ad6eec7b 29-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Tweak the in-kernel API for sending signals to threads:
- Rename tdsignal() to tdsendsignal() and make it private to kern_sig.c.
- Add tdsignal() and tdksignal() routines that mirror psignal() and
pksignal() except that they accept a thread as an argument instead of
a process. They send a signal to a specific thread rather than to an
individual process.

Reviewed by: kib


# afe1a688 23-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Reorganize syscall entry and leave handling.

Extend struct sysvec with three new elements:
sv_fetch_syscall_args - the method to fetch syscall arguments from
usermode into struct syscall_args. The structure is machine-depended
(this might be reconsidered after all architectures are converted).
sv_set_syscall_retval - the method to set a return value for usermode
from the syscall. It is a generalization of
cpu_set_syscall_retval(9) to allow ABIs to override the way to set a
return value.
sv_syscallnames - the table of syscall names.

Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding
the call to cpu_set_syscall_retval().

The new functions syscallenter(9) and syscallret(9) are provided that
use sv_*syscall* pointers and contain the common repeated code from
the syscall() implementations for the architecture-specific syscall
trap handlers.

Syscallenter() fetches arguments, calls syscall implementation from
ABI sysent table, and set up return frame. The end of syscall
bookkeeping is done by syscallret().

Take advantage of single place for MI syscall handling code and
implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and
PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the
thread is stopped at syscall entry or return point respectively. The
EXEC flag augments SCX and notifies debugger that the process address
space was changed by one of exec(2)-family syscalls.

The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are
changed to use syscallenter()/syscallret(). MIPS and arm are not
converted and use the mostly unchanged syscall() implementation.

Reviewed by: jhb, marcel, marius, nwhitehorn, stas
Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc),
stas (mips)
MFC after: 1 month


# eb00b276 06-May-2010 Alan Cox <alc@FreeBSD.org>

Eliminate page queues locking around most calls to vm_page_free().


# 5ac59343 05-May-2010 Alan Cox <alc@FreeBSD.org>

Acquire the page lock around all remaining calls to vm_page_free() on
managed pages that didn't already have that lock held. (Freeing an
unmanaged page, such as the various pmaps use, doesn't require the page
lock.)

This allows a change in vm_page_remove()'s locking requirements. It now
expects the page lock to be held instead of the page queues lock.
Consequently, the page queues lock is no longer required at all by callers
to vm_page_rename().

Discussed with: kib


# 2965a453 29-Apr-2010 Kip Macy <kmacy@FreeBSD.org>

On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib


# a0ea661f 25-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Add the ELF relocation base to struct image_params. This will be
required to correctly relocate the executable entry point's function
descriptor on powerpc64.


# a107d8aa 25-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

Change the arguments of exec_setregs() so that it receives a pointer
to the image_params struct instead of several members of that struct
individually. This makes it easier to expand its arguments in the future
without touching all platforms.

Reviewed by: jhb


# f4e26ade 23-Mar-2010 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

The nargvstr and nenvstr properties of arginfo are ints, not longs,
so should be copied to userspace with suword32() instead of suword().
This alleviates problems on 64-bit big-endian architectures, and is a
no-op on all 32-bit architectures.

Tested on: amd64, sparc64, powerpc64


# 49cc1344 21-Jan-2010 John Baldwin <jhb@FreeBSD.org>

MFC 198411:
- Fix several off-by-one errors when using MAXCOMLEN. The p_comm[] and
td_name[] arrays are actually MAXCOMLEN + 1 in size and a few places that
created shadow copies of these arrays were just using MAXCOMLEN.
- Prefer using sizeof() of an array type to explicit constants for the
array length in a few places.
- Ensure that all of p_comm[] and td_name[] is always zero'd during
execve() to guard against any possible information leaks. Previously
trailing garbage in p_comm[] could be leaked to userland in ktrace
record headers via td_name[].


# 5ca4819d 23-Oct-2009 John Baldwin <jhb@FreeBSD.org>

- Fix several off-by-one errors when using MAXCOMLEN. The p_comm[] and
td_name[] arrays are actually MAXCOMLEN + 1 in size and a few places that
created shadow copies of these arrays were just using MAXCOMLEN.
- Prefer using sizeof() of an array type to explicit constants for the
array length in a few places.
- Ensure that all of p_comm[] and td_name[] is always zero'd during
execve() to guard against any possible information leaks. Previously
trailing garbage in p_comm[] could be leaked to userland in ktrace
record headers via td_name[].

Reviewed by: bde


# bc7f0010 02-Oct-2009 Simon L. B. Nielsen <simon@FreeBSD.org>

MFC r197711:

Add no zero mapping feature.

NOTE: Unlike in the other branches where this change will be "merged"
to, the 'no zero mapping' is enabled by default in stable/8.

Errata: FreeBSD-EN-09:05.null
Approved by: re (kib)


# 878adb85 02-Oct-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Add a mitigation feature that will prevent user mappings at
virtual address 0, limiting the ability to convert a kernel
NULL pointer dereference into a privilege escalation attack.

If the sysctl is set to 0 a newly started process will not be able
to map anything in the address range of the first page (0 to PAGE_SIZE).
This is the default. Already running processes are not affected by this.

You can either change the sysctl or the tunable from loader in case
you need to map at a virtual address of 0, for example when running
any of the extinct species of a set of a.out binaries, vm86 emulation, ..
In that case set security.bsd.map_at_zero="1".

Superseeds: r197537
In collaboration with: jhb, kib, alc


# d4c8e5ac 12-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197031:
Unlock the image vnode around the call of pmc PMC_FN_PROCESS_EXEC hook.
The hook calls vn_fullpath(9), that should not be executed with a vnode
lock held.

Approved by: re (kensmith)


# 1ef6ea9b 09-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

Unlock the image vnode around the call of pmc PMC_FN_PROCESS_EXEC hook.
The hook calls vn_fullpath(9), that should not be executed with a vnode
lock held.

Reported by: Bruce Cran <bruce cran org uk>
Tested by: pho
MFC after: 3 days


# 87eca70e 31-Jul-2009 John Baldwin <jhb@FreeBSD.org>

Fix some LORs between vnode locks and filedescriptor table locks.
- Don't grab the filedesc lock just to read fd_cmask.
- Drop vnode locks earlier when mounting the root filesystem and before
sanitizing stdin/out/err file descriptors during execve().

Submitted by: kib
Approved by: re (rwatson)
MFC after: 1 week


# b146fc1b 28-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Rework vnode argument auditing to follow the same structure, in order
to avoid exposing ARG_ macros/flag values outside of the audit code in
order to name which one of two possible vnodes will be audited for a
system call.

Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 month


# 14961ba7 27-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week


# 838d9858 19-Jun-2009 Brooks Davis <brooks@FreeBSD.org>

Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.

Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.

Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867


# 0a2e596a 07-Jun-2009 Alan Cox <alc@FreeBSD.org>

Eliminate unnecessary obfuscation when testing a page's valid bits.


# d1a6e42d 06-Jun-2009 Alan Cox <alc@FreeBSD.org>

If vm_pager_get_pages() returns VM_PAGER_OK, then there is no need to check
the page's valid bits. The page is guaranteed to be fully valid. (For the
record, this is documented in vm/vm_pager.h's comments.)


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 3ff06357 16-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Supply AT_EXECPATH auxinfo entry to the interpreter, both for native and
compat32 binaries.

Tested by: pho
Reviewed by: kan


# 69c9eff8 26-Feb-2009 Ed Schouten <ed@FreeBSD.org>

Remove unneeded pointer `ndp'.

Inside do_execve(), we have a pointer `ndp', which always points to
`&nd'. I can imagine a primitive (non-optimizing) compiler to really
reserve space for such a pointer, so just remove the variable and use
`&nd' directly.


# c90c9021 26-Feb-2009 Ed Schouten <ed@FreeBSD.org>

Remove even more unneeded variable assignments.

kern_time.c:
- Unused variable `p'.

kern_thr.c:
- Variable `error' is always caught immediately, so no reason to
initialize it. There is no way that error != 0 at the end of
create_thread().

kern_sig.c:
- Unused variable `code'.

kern_synch.c:
- `rval' is always assigned in all different cases.

kern_rwlock.c:
- `v' is always overwritten with RW_UNLOCKED further on.

kern_malloc.c:
- `size' is always initialized with the proper value before being used.

kern_exit.c:
- `error' is always caught and returned immediately. abort2() never
returns a non-zero value.

kern_exec.c:
- `len' is always assigned inside the if-statement right below it.

tty_info.c:
- `td' is always overwritten by FOREACH_THREAD_IN_PROC().

Found by: LLVM's scan-build


# aeb32571 05-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

Several threads in a process may do vfork() simultaneously. Then, all
parent threads sleep on the parent' struct proc until corresponding
child releases the vmspace. Each sleep is interlocked with proc mutex of
the child, that triggers assertion in the sleepq_add(). The assertion
requires that at any time, all simultaneous sleepers for the channel use
the same interlock.

Silent the assertion by using conditional variable allocated in the
child. Broadcast the variable event on exec() and exit().

Since struct proc * sleep wait channel is overloaded for several
unrelated events, I was unable to remove wakeups from the places where
cv_broadcast() is added, except exec().

Reported and tested by: ganbold
Suggested and reviewed by: jhb
MFC after: 2 week


# e506f34b 05-Nov-2008 Craig Rodrigues <rodrigc@FreeBSD.org>

Merge latest DTrace changes from Perforce.

Approved by: jb


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# dfa7fd1d 10-Sep-2008 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove VSVTX, VSGID and VSUID. This should be a no-op,
as VSVTX == S_ISVTX, VSGID == S_ISGID and VSUID == S_ISUID.

Approved by: rwatson (mentor)


# 0359a12e 28-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 3f397884 25-Aug-2008 Robert Watson <rwatson@FreeBSD.org>

More fully audit fexecve(2) and its arguments.

Obtained from: TrustedBSD Project
Sponsored by: Google, Inc.


# 6356dba0 23-Aug-2008 Robert Watson <rwatson@FreeBSD.org>

Introduce two related changes to the TrustedBSD MAC Framework:

(1) Abstract interpreter vnode labeling in execve(2) and mac_execve(2)
so that the general exec code isn't aware of the details of
allocating, copying, and freeing labels, rather, simply passes in
a void pointer to start and stop functions that will be used by
the framework. This change will be MFC'd.

(2) Introduce a new flags field to the MAC_POLICY_SET(9) interface
allowing policies to declare which types of objects require label
allocation, initialization, and destruction, and define a set of
flags covering various supported object types (MPC_OBJECT_PROC,
MPC_OBJECT_VNODE, MPC_OBJECT_INPCB, ...). This change reduces the
overhead of compiling the MAC Framework into the kernel if policies
aren't loaded, or if policies require labels on only a small number
or even no object types. Each time a policy is loaded or unloaded,
we recalculate a mask of labeled object types across all policies
present in the system. Eliminate MAC_ALWAYS_LABEL_MBUF option as it
is no longer required.

MFC after: 1 week ((1) only)
Reviewed by: csjp
Obtained from: TrustedBSD Project
Sponsored by: Apple, Inc.


# ded7d39c 12-Aug-2008 Christian S.J. Peron <csjp@FreeBSD.org>

Reduce the scope of the vnode lock such that it does not cover
the various copyouts associated with initializing the process's
argv/env data in userspace. It is possible that these copyout
operations can fault under memory pressure, possibly resulting
in dead locks. This is believed to be safe since none of the
copyout_strings() operations need to interact with the vnode here.

Submitted by: Zhouyi Zhou
PR: kern/111260
Discussed with: kib
MFC after: 3 weeks


# 58e8af1b 25-Jul-2008 Konstantin Belousov <kib@FreeBSD.org>

Call pargs_drop() unconditionally in do_execve(), the function correctly
handles the NULL argument.
Make pargs_free() static.

MFC after: 1 week


# 9a75ea23 17-Jul-2008 Konstantin Belousov <kib@FreeBSD.org>

Pair the VOP_OPEN call from do_execve() with the reciprocal VOP_CLOSE.
This was unnoticed because local filesystems usually do nothing
non-trivial in the close vop.

Reported and tested by: Rick Macklem
MFC after: 2 weeks


# 5d217f17 24-May-2008 John Birrell <jb@FreeBSD.org>

Add DTrace 'proc' provider probes using the Statically Defined Trace
(sdt) mechanism.


# 632dbc19 30-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Implement the fexecve(2) syscall.

Based on the submission by rdivacky,
sponsored by Google Summer of Code 2007
Reviewed by: rwatson, rdivacky
Tested by: pho


# 6617724c 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# f8a47341 29-Dec-2007 Alan Cox <alc@FreeBSD.org>

Add the superpage reservation system. This is "part 2 of 2" of the
machine-independent support for superpages. (The earlier part was
the rewrite of the physical memory allocator.) The remainder of the
code required for superpages support is machine-dependent and will
be added to the various pmap implementations at a later date.

Initially, I am only supporting one large page size per architecture.
Moreover, I am only enabling the reservation system on amd64. (In
an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0
in amd64/include/vmparam.h or your kernel configuration file.)


# f231de47 03-Dec-2007 Konstantin Belousov <kib@FreeBSD.org>

Implement fetching of the __FreeBSD_version from the ELF ABI-tag note.
The value is read into the p_osrel member of the struct proc. p_osrel
is set to 0 for the binaries without the note.

MFC after: 3 days


# ca081fdb 13-Nov-2007 Julian Elischer <julian@FreeBSD.org>

Make sure there is a good default thread name for all threads.


# 89b57fcf 05-Nov-2007 Konstantin Belousov <kib@FreeBSD.org>

Fix for the panic("vm_thread_new: kstack allocation failed") and
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.

As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.

The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).

In collaboration with: Peter Holm
Reviewed by: jhb


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 7bfda801 25-Sep-2007 Alan Cox <alc@FreeBSD.org>

Change the management of cached pages (PQ_CACHE) in two fundamental
ways:

(1) Cached pages are no longer kept in the object's resident page
splay tree and memq. Instead, they are kept in a separate per-object
splay tree of cached pages. However, access to this new per-object
splay tree is synchronized by the _free_ page queues lock, not to be
confused with the heavily contended page queues lock. Consequently, a
cached page can be reclaimed by vm_page_alloc(9) without acquiring the
object's lock or the page queues lock.

This solves a problem independently reported by tegge@ and Isilon.
Specifically, they observed the page daemon consuming a great deal of
CPU time because of pages bouncing back and forth between the cache
queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of
this problem turned out to be a deadlock avoidance strategy employed
when selecting a cached page to reclaim in vm_page_select_cache().
However, the root cause was really that reclaiming a cached page
required the acquisition of an object lock while the page queues lock
was already held. Thus, this change addresses the problem at its
root, by eliminating the need to acquire the object's lock.

Moreover, keeping cached pages in the object's primary splay tree and
memq was, in effect, optimizing for the uncommon case. Cached pages
are reclaimed far, far more often than they are reactivated. Instead,
this change makes reclamation cheaper, especially in terms of
synchronization overhead, and reactivation more expensive, because
reactivated pages will have to be reentered into the object's primary
splay tree and memq.

(2) Cached pages are now stored alongside free pages in the physical
memory allocator's buddy queues, increasing the likelihood that large
allocations of contiguous physical memory (i.e., superpages) will
succeed.

Finally, as a result of this change long-standing restrictions on when
and where a cached page can be reclaimed and returned by
vm_page_alloc(9) are eliminated. Specifically, calls to
vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and
return a formerly cached page. Consequently, a call to malloc(9)
specifying M_NOWAIT is less likely to fail.

Discussed with: many over the course of the summer, including jeff@,
Justin Husted @ Isilon, peter@, tegge@
Tested by: an earlier version by kris@
Approved by: re (kensmith)


# 59d8f3ff 12-Jul-2007 John Baldwin <jhb@FreeBSD.org>

Fix a couple of issues with the stack limit for 32-bit processes on 64-bit
kernels exposed by the recent fixes to resource limits for 32-bit processes
on 64-bit kernels:
- Let ABIs expose their maximum stack size via a new pointer in sysentvec
and use that in preference to maxssiz during exec() rather than always
using maxssiz for all processses.
- Apply the ABI's limit fixup to the previous stack size when adjusting
RLIMIT_STACK to determine if the existing mapping for the stack needs to
be grown or shrunk (as well as how much it should be grown or shrunk).

Approved by: re (kensmith)


# ce0be646 13-Jun-2007 John Baldwin <jhb@FreeBSD.org>

Conditionally acquire Giant when dropping a reference on the ktrace vnode
during execve() when turning off tracing due to executing a setuid binary
as non-root. Previously this could fail to acquire Giant and fail an
assertion if the ktrace file was on a non-MPSAFE filesystem and the
executable was on an MPSAFE filesystem.

MFC after: 3 days
Reported by: kris


# 32f9753c 11-Jun-2007 Robert Watson <rwatson@FreeBSD.org>

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


# 9e223287 31-May-2007 Konstantin Belousov <kib@FreeBSD.org>

Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)


# 19059a13 14-May-2007 John Baldwin <jhb@FreeBSD.org>

Rework the support for ABIs to override resource limits (used by 32-bit
processes under 64-bit kernels). Previously, each 32-bit process overwrote
its resource limits at exec() time. The problem with this approach is that
the new limits affect all child processes of the 32-bit process, including
if the child process forks and execs a 64-bit process. To fix this, don't
ovewrite the resource limits during exec(). Instead, sv_fixlimits() is
now replaced with a different function sv_fixlimit() which asks the ABI to
sanitize a single resource limit. We then use this when querying and
setting resource limits. Thus, if a 32-bit process sets a limit, then
that new limit will be inherited by future children. However, if the
32-bit process doesn't change a limit, then a future 64-bit child will
see the "full" 64-bit limit rather than the 32-bit limit.

MFC is tentative since it will break the ABI of old linux.ko modules (no
other modules are affected).

MFC after: 1 week


# bd37fd72 25-Mar-2007 Kris Kennaway <kris@FreeBSD.org>

Update a comment: we usually call exec_vmspace_new with Giant not held,
but sometimes it is.


# 873fbcd7 05-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Further system call comment cleanup:

- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.


# 0c14ff0e 04-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 2a53696f 22-Oct-2006 Alan Cox <alc@FreeBSD.org>

The page queues lock is no longer required by vm_page_busy() or
vm_page_wakeup(). Reduce or eliminate its use accordingly.


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 9af80719 21-Oct-2006 Alan Cox <alc@FreeBSD.org>

Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's
busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object
lock instead of the global page queues lock.


# ae1078d6 01-Sep-2006 Wayne Salamon <wsalamon@FreeBSD.org>

Audit the argv and env vectors passed in on exec:
Add the argument auditing functions for argv and env.
Add kernel-specific versions of the tokenizer functions for the
arg and env represented as a char array.
Implement the AUDIT_ARGV and AUDIT_ARGE audit policy commands to
enable/disable argv/env auditing.
Call the argument auditing from the exec system calls.

Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)


# 993182e5 14-Aug-2006 Alexander Leidinger <netchild@FreeBSD.org>

- Change process_exec function handlers prototype to include struct
image_params arg.
- Change struct image_params to include struct sysentvec pointer and
initialize it.
- Change all consumers of process_exit/process_exec eventhandlers to
new prototypes (includes splitting up into distinct exec/exit functions).
- Add eventhandler to userret.

Sponsored by: Google SoC 2006
Submitted by: rdivacky
Parts suggested by: jhb (on hackers@)


# 4bb260ad 28-May-2006 Robert Watson <rwatson@FreeBSD.org>

In execve(), audit the path name being executed. In the future, it
would also be good to audit the interpreter pathname, if any.

Obtained from: TrustedBSD Project


# d302786c 05-May-2006 Tor Egge <tegge@FreeBSD.org>

Temporarily unlock vnode for new image being executed to avoid lock order
reversals that can lead to deadlocks. Normally vn_close(), namei() or vrele()
should not be called while holding vnode locks.


# b9eee07e 03-Apr-2006 Peter Wemm <peter@FreeBSD.org>

Remove the unused sva and eva arguments from pmap_remove_pages().


# 68ff3c24 08-Mar-2006 Stephan Uphoff <ups@FreeBSD.org>

Fix exec_map resource leaks.

Tested by: kris@


# 8917b8d2 06-Feb-2006 John Baldwin <jhb@FreeBSD.org>

- Always call exec_free_args() in kern_execve() instead of doing it in all
the callers if the exec either succeeds or fails early.
- Move the code to call exit1() if the exec fails after the vmspace is
gone to the bottom of kern_execve() to cut down on some code duplication.


# 68ce4375 02-Feb-2006 Jeff Roberson <jeff@FreeBSD.org>

- textvp may have been from a different mountpoint than ndp->ni_vp and
we may need to acquire giant to vrele it.

Found by: mjacob
MFC After: 3 days


# 05406e6f 11-Dec-2005 Alan Cox <alc@FreeBSD.org>

Remove unneeded calls to pmap_remove_all(). The given page is not mapped.

Reviewed by: tegge


# d26b1a1f 08-Dec-2005 David Xu <davidxu@FreeBSD.org>

Register itimers_event_hook as a kernel event handler, so I don't
have to duplicate code to call it in exec() and exit1().


# 8ad398d0 06-Dec-2005 Alan Cox <alc@FreeBSD.org>

Reduce the scope of the page queues lock in exec_map_first_page(). The vm
object lock is sufficient for reading a page's PG_BUSY and busy flags.

MFC after: 1 week


# 6d7b314b 02-Nov-2005 David Xu <davidxu@FreeBSD.org>

Cleanup some signal interfaces. Now the tdsignal function accepts
both proc pointer and thread pointer, if thread pointer is NULL,
tdsignal automatically finds a thread, otherwise it sends signal
to given thread.
Add utility function psignal_event to send a realtime sigevent
to a process according to the delivery requirement specified in
struct sigevent.


# 1471f287 02-Nov-2005 Paul Saab <ps@FreeBSD.org>

Calling setrlimit from 32bit apps could potentially increase certain
limits beyond what should be capiable in a 32bit process, so we
must fixup the limits.

Reviewed by: jhb


# 60354683 22-Oct-2005 David Xu <davidxu@FreeBSD.org>

Make p_itimers as a pointer, so file sys/proc.h does not need to include
sys/timers.h.


# 86857b36 22-Oct-2005 David Xu <davidxu@FreeBSD.org>

Implement POSIX timers. Current only CLOCK_REALTIME and CLOCK_MONOTONIC
clock are supported. I have plan to merge XSI timer ITIMER_REAL and other
two CPU timers into the new code, current three slots are available for
the XSI timers.
The SIGEV_THREAD notification type is not supported yet because our
sigevent struct lacks of two member fields:
sigev_notify_function
sigev_notify_attributes
I have found the sigevent is used in AIO, so I won't add the two members
unless the AIO code is adjusted.


# 9104847f 13-Oct-2005 David Xu <davidxu@FreeBSD.org>

1. Change prototype of trapsignal and sendsig to use ksiginfo_t *, most
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.

2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.

3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.

4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.

5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.

6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.

Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64


# 9f5c1d19 12-Oct-2005 Diomidis Spinellis <dds@FreeBSD.org>

Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks


# 34ea500b 03-Oct-2005 Don Lewis <truckman@FreeBSD.org>

Add missing word to comment.


# 33812c06 02-Oct-2005 Colin Percival <cperciva@FreeBSD.org>

If sufficiently bad things happen during a call to kern_execve(), it is
possible for do_execve() to call exit1() rather than returning. As a
result, the sequence "allocate memory; call kern_execve; free memory"
can end up leaking memory.

This commit documents this astonishing behaviour and adds a call to
exec_free_args() before the exit1() call in do_execve(). Since all
the users of kern_execve() in the tree use exec_free_args() to free
the command-line arguments after kern_execve() returns, this should
be safe, and it fixes the memory leak which can otherwise occur.

Submitted by: Peter Holm
MFC after: 3 days
Security: Local denial of service


# 5997cae9 01-Oct-2005 Don Lewis <truckman@FreeBSD.org>

Copy new process argument list in do_execve() before grabbing PROC_LOCK
to avoid touching pageable memory while holding a mutex.

Simplify argument list replacement logic.

PR: kern/84935
Submitted by: "Antoine Pelisse" apelisse AT gmail.com (in a different form)
MFC after: 3 days


# 15139246 30-Jun-2005 Joseph Koshy <jkoshy@FreeBSD.org>

MFP4:

- pmcstat(8) gprof output mode fixes:

lib/libpmc/pmclog.{c,h}, sys/sys/pmclog.h:
+ Add a 'is_usermode' field to the PMCLOG_PCSAMPLE event
+ Add an 'entryaddr' field to the PMCLOG_PROCEXEC event,
so that pmcstat(8) can determine where the runtime loader
/libexec/ld-elf.so.1 is getting loaded.

sys/kern/kern_exec.c:
+ Use a local struct to group the entry address of the image being
exec()'ed and the process credential changed flag to the exec
handling hook inside hwpmc(4).

usr.sbin/pmcstat/*:
+ Support "-k kernelpath", "-D sampledir".
+ Implement the ELF bits of 'gmon.out' profile generation in a new
file "pmcstat_log.c". Move all log related functions to this
file.
+ Move local definitions and prototypes to "pmcstat.h"

- Other bug fixes:
+ lib/libpmc/pmclog.c: correctly handle EOF in pmclog_read().
+ sys/dev/hwpmc_mod.c: unconditionally log a PROCEXIT event to all
attached PMCs when a process exits.
+ sys/sys/pmc.h: correct a function prototype.
+ Improve usage checks in pmcstat(8).

Approved by: re (blanket hwpmc)


# 4da0d332 23-Jun-2005 Peter Wemm <peter@FreeBSD.org>

Move HWPMC_HOOKS into its own opt_hwpmc_hooks.h file. It doesn't merit
being in opt_global.h and forcing a global recompile when only a few files
reference it.

Approved by: re


# f263522a 09-Jun-2005 Joseph Koshy <jkoshy@FreeBSD.org>

MFP4:

- Implement sampling modes and logging support in hwpmc(4).

- Separate MI and MD parts of hwpmc(4) and allow sharing of
PMC implementations across different architectures.
Add support for P4 (EMT64) style PMCs to the amd64 code.

- New pmcstat(8) options: -E (exit time counts) -W (counts
every context switch), -R (print log file).

- pmc(3) API changes, improve our ability to keep ABI compatibility
in the future. Add more 'alias' names for commonly used events.

- bug fixes & documentation.


# 6341095e 31-May-2005 Ken Smith <kensmith@FreeBSD.org>

This patch addresses a standards violation issue. The standards say a
file's access time should be updated when it gets executed. A while
ago the mechanism used to exec was changed to use a more mmap based
mechanism and this behavior was broken as a side-effect of that.

A new vnode flag is added that gets set when the file gets executed,
and the VOP_SETATTR() vnode operation gets called. The underlying
filesystem is expected to handle it based on its own semantics, some
filesystems don't support access time at all. Those that do should
handle it in a way that does not block, does not generate I/O if possible,
etc. In particular vn_start_write() has not been called. The UFS code
handles it the same way as it would normally handle the access time if
a file was read - the IN_ACCESS flag gets set in the inode but no other
action happens at this point. The actual time update will happen later
during a sync (which handles all the necessary locking).

Got me into this: cperciva
Discussed with: a lot with bde, a little with kan
Showed patches to: phk, jeffr, standards@, arch@
Minor discussion on: arch@


# 532c491b 03-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- Initialize vfslocked correctly early enough for MAC to compile.
- Fix one place where we explicitly drop Giant!

Pointy hat to: me
Submitted by: Max Laier
Warned by: Tinderbox


# 269576c8 03-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- Use namei to acquire Giant for VFS if it is necessary. Drop the explicit
Giant acquisition.
- Remove GIANT_REQUIRED in the few remaining cases; the vm and vfs have
both been locked.


# 4d7d2993 30-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Return EACCES if we're trying to exec on a vp with no object.

Errno supplied by: cperciva


# 7625cbf3 27-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Pass the ISOPEN flag to namei so filesystems will know we're about to
open them or otherwise access the data.


# ebccf1e3 18-Apr-2005 Joseph Koshy <jkoshy@FreeBSD.org>

Bring a working snapshot of hwpmc(4), its associated libraries, userland utilities
and documentation into -CURRENT.

Bump FreeBSD_version.

Reviewed by: alc, jhb (kernel changes)


# 90dc539b 25-Feb-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Welcome to the 21st century: increase MAXSHELLCMDLEN from 128 bytes to
PAGE_SIZE.

Unlike originator of the PR suggests retain MAXSHELLCMDLEN definition
(he has been proposing to replace it with PAGE_SIZE everywhere), not only
this reduced the diff significantly, but prevents code obfuscation and also
allows to increase/decrease this parameter easily if needed.

PR: kern/64196
Submitted by: Magnus Bäckström <b@etek.chalmers.se>


# 56c2262c 29-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Grrr, this committer needs to have a sleep. Remove lines from the previous
delta not intended for public consumption.

MFC after: 2 weeks


# c30af532 29-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

Fix small non-conformance introduced in the previous commit: execve() is
expected to return ENAMETOOLONG, not E2BIG if first argument doesn't
fit into {PATH_MAX} bytes.

MFC after: 2 weeks


# 610ecfe0 29-Jan-2005 Maxim Sobolev <sobomax@FreeBSD.org>

o Split out kernel part of execve(2) syscall into two parts: one that
copies arguments into the kernel space and one that operates
completely in the kernel space;

o use kernel-only version of execve(2) to kill another stackgap in
linuxlator/i386.

Obtained from: DragonFlyBSD (partially)
MFC after: 2 weeks


# 8516dd18 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use VOP_GETVOBJECT, use vp->v_object directly.


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# c113083c 14-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add new function fdunshare() which encapsulates the necessary light magic
for ensuring that a process' filedesc is not shared with anybody.

Use it in the two places which previously had private implmentations.

This collects all fd_refcnt handling in kern_descrip.c


# 6004362e 26-Nov-2004 David Schultz <das@FreeBSD.org>

Don't include sys/user.h merely for its side-effect of recursively
including other headers.


# 124e4c3b 13-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.

Use this in all the places where sleeping with the lock held is not
an issue.

The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.


# 598b7ec8 07-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Use more intuitive pointer for fdinit() and fdcopy().

Change fdcopy() to take unlocked filedesc.


# a7bc3102 11-Oct-2004 Peter Wemm <peter@FreeBSD.org>

Put on my peril sensitive sunglasses and add a flags field to the internal
sysctl routines and state. Add some code to use it for signalling the need
to downconvert a data structure to 32 bits on a 64 bit OS when requested by
a 32 bit app.

I tried to do this in a generic abi wrapper that intercepted the sysctl
oid's, or looked up the format string etc, but it was a real can of worms
that turned into a fragile mess before I even got it partially working.

With this, we can now run 'sysctl -a' on a 32 bit sysctl binary and have
it not abort. Things like netstat, ps, etc have a long way to go.

This also fixes a bug in the kern.ps_strings and kern.usrstack hacks.
These do matter very much because they are used by libc_r and other things.


# 84e0b075 07-Oct-2004 David Xu <davidxu@FreeBSD.org>

Add an execve command for kse_thr_interrupt to allow libpthread to
restore signal mask correctly, this is required by POSIX.

Reviewed by: deischen


# 906ac69d 05-Oct-2004 David Xu <davidxu@FreeBSD.org>

In original kern_execve() code, at the start of the function, it forces
all other threads to suicide, problem is execve() could be failed, and
a failed execve() would change threaded process to unthreaded, this side
effect is unexpected.
The new code introduces a new single threading mode SINGLE_BOUNDARY, in
the mode, all threads should suspend themself at user boundary except
the singler. we can not use SINGLE_NO_EXIT because we want to start from
a clean state if execve() is successful, suspending other threads at unknown
point and later resuming them from there and forcing them to exit at user
boundary may cause the process to start from a dirty state. If execve() is
successful, current thread upgrades to SINGLE_EXIT mode and forces other
threads to suicide at user boundary, otherwise, other threads will be resumed
and their interrupted syscall will be restarted.

Reviewed by: julian


# 63993cf0 23-Sep-2004 John Baldwin <jhb@FreeBSD.org>

- Don't try to unlock Giant if single threading fails since we don't have
it locked.
- Unlock Giant before calling exit1() since exit1() does not require Giant.


# 2e2e32b2 21-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Revert the last change..
Better to kill all other threads than to panic the system if 2 threads call
execve() at the same time. A better fix will be committed later.

Note that this only affects the case where the execve fails.


# 29780059 21-Sep-2004 Julian Elischer <julian@FreeBSD.org>

In a threaded process, don't kill off all the other threads until we have a
reasonable chance that the eceve() is going to succeeed. I.e.
wait until we've done the permission checks etc.

MFC after: 1 week


# ed062c8d 04-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 56f21b9d 26-Jul-2004 Colin Percival <cperciva@FreeBSD.org>

Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb


# aa3c8c02 23-Jul-2004 Julian Elischer <julian@FreeBSD.org>

White space fix..
diff reduction for upcoming commit.


# ce8da309 12-Jul-2004 Alan Cox <alc@FreeBSD.org>

Push down the acquisition and release of the page queues lock into
pmap_remove_pages(). (The implementation of pmap_remove_pages() is
optional. If pmap_remove_pages() is unimplemented, the acquisition and
release of the page queues lock is unnecessary.)

Remove spl calls from the alpha, arm, and ia64 pmap_remove_pages().


# aa0aa7a1 02-Jun-2004 Tim J. Robbins <tjr@FreeBSD.org>

Move TDF_SA from td_flags to td_pflags (and rename it accordingly)
so that it is no longer necessary to hold sched_lock while
manipulating it.

Reviewed by: davidxu


# 702ac0f1 21-May-2004 David Xu <davidxu@FreeBSD.org>

Clear KSE thread flags after KSE thread mode is ended. The side effect
of not clearing the flags for execv() syscall will result that a new
program runs in KSE thread mode without enabling it.

Submitted by: tjr
Modified by: davidxu


# 59c8bc40 22-Apr-2004 Alan Cox <alc@FreeBSD.org>

Utilize sf_buf_alloc() rather than pmap_qenter() (and sometimes
kmem_alloc_wait()) for mapping the image header. On all machines with a
direct virtual-to-physical mapping and SMP/HTT i386s, this is a clear win.


# 148b3f62 11-Apr-2004 Alan Cox <alc@FreeBSD.org>

Use vm_page_hold() rather than vm_page_wire() for short-duration page
wiring. The reason being that vm_page_hold() is cheaper.


# 2fc0588d 31-Mar-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove sysctl kern.ps_argsopen, it is not very useful, one should use
security.bsd.see_other_uids instead.

Discussed with: phk, rwatson


# a5bdcb2a 13-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Make the process_exit eventhandler run without Giant. Add Giant hooks
in the two consumers that need it.. processes using AIO and netncp.
Update docs. Say that process_exec is called with Giant, but not to
depend on it. All our consumers can handle it without Giant.


# 37814395 13-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Push Giant down a little further:
- no longer serialize on Giant for thread_single*() and family in fork,
exit and exec
- thread_wait() is mpsafe, assert no Giant
- reduce scope of Giant in exit to not cover thread_wait and just do
vm_waitproc().
- assert that thread_single() family are not called with Giant
- remove the DROP/PICKUP_GIANT macros from thread_single() family
- assert that thread_suspend_check() s not called with Giant
- remove manual drop_giant hack in thread_suspend_check since we know it
isn't held.
- remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family
- mark kse_create() mpsafe


# 7700eb86 12-Mar-2004 Ruslan Ermilov <ru@FreeBSD.org>

Do what the execve(2) manpage says and enforce what a Strictly
Conforming POSIX application should do by disallowing the argv
argument to be NULL.

PR: kern/33738
Submitted by: Marc Olzheim, Serge van den Boom
OK'ed by: nectar


# 8144e3b8 05-Mar-2004 John Baldwin <jhb@FreeBSD.org>

Lock Giant around the single threading code in exec() to satisfy an
assertion in the single threading code.


# df7c361e 17-Feb-2004 Peter Wemm <peter@FreeBSD.org>

Checkpoint a hack to enable running i386 libc_r binaries on a 64 bit
kernel. I'm not happy with it yet - refinements are to come.
This hack allows the kern.ps_strings and kern.usrstack sysctls to respond
to a 32 bit request, such as those coming from emulated i386 binaries.


# d6c847f3 27-Dec-2003 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs (mainly, try to always use explicit comparisons with
NULL when checking for null pointers).


# ca46e90e 27-Dec-2003 Bruce Evans <bde@FreeBSD.org>

Fixed some disordering in revs.1.194 and 1,196. Moved the exceve() syscall
function back to near the beginning of the file. Rev.1.194 moved it into
the middle of auxiliary functions following kern_execve(). Moved the
__mac_execve() syscall function up together with execve(). It was new in
rev1.1.196 and perfectly misplaced after execve().


# 34d26757 27-Dec-2003 Alan Cox <alc@FreeBSD.org>

Remove GIANT_REQUIRED from exec_unmap_first_page().


# eca8a663 11-Nov-2003 Robert Watson <rwatson@FreeBSD.org>

Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 9ee99eb4 20-Oct-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Remove md_bspstore from the MD fields of struct thread. Now that
the backing store is at a fixed address, there's no need for a
per-thread variable.


# bab1f052 19-Oct-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Put the RSE backing store at a fixed address. This change is triggered
by libguile that needs to know the base of the RSE backing store. We
currently do not export the fixed address to userland by means of a
sysctl so user code needs to hardcode it for now. This will be revisited
later.

The RSE backing store is now at the bottom of region 4. The memory stack
is at the top of region 4. This means that the whole region is usable
for the stacks, giving a 61-bit stack space.

Port: lang/guile (depended of x11/gnome2)


# 6ec2fca5 04-Oct-2003 Alan Cox <alc@FreeBSD.org>

Eliminate some unnecessary uses of the vm page queues lock around the
vm page's valid field. This field is being synchronized using the
containing vm object's lock.


# c31f2280 27-Sep-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Remove the regstkpages sysctl variable. We have a growable register
stack now.


# fd75d710 27-Sep-2003 Marcel Moolenaar <marcel@FreeBSD.org>

Part 2 of implementing rstacks: add the ability to create rstacks and
use the ability on ia64 to map the register stack. The orientation of
the stack (i.e. its grow direction) is passed to vm_map_stack() in the
overloaded cow argument. Since the grow direction is represented by
bits, it is possible and allowed to create bi-directional stacks.
This is not an advertised feature, more of a side-effect.

Fix a bug in vm_map_growstack() that's specific to rstacks and which
we could only find by having the ability to create rstacks: when
the mapped stack ends at the faulting address, we have not actually
mapped the faulting address. we need to include or cover the faulting
address.

Note that at this time mmap(2) has not been extended to allow the
creation of rstacks by processes. If such a need arises, this can
be done.

Tested on: alpha, i386, ia64, sparc64


# c460ac3a 24-Sep-2003 Peter Wemm <peter@FreeBSD.org>

Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit
systems where the data/stack/etc limits are too big for a 32 bit process.

Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.

Supply an ia32_fixlimits function. Export the clip/default values to
sysctl under the compat.ia32 heirarchy.

Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable. This allows mmap to
place mappings at sensible locations when limits have been reduced.

Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.

Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'. And that causes exec etc to fail since it can no
longer find space to mmap things.


# a8d43c90 26-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add a "int fd" argument to VOP_OPEN() which in the future will
contain the filedescriptor number on opens from userland.

The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*

For now pass -1 all over the place.


# 0e2a4d3a 14-Jun-2003 David Xu <davidxu@FreeBSD.org>

Rename P_THREADED to P_SA. P_SA means a process is using scheduler
activations.


# 8630c117 12-Jun-2003 Alan Cox <alc@FreeBSD.org>

Add vm object locking to various pagers' "get pages" methods, i386 stack
management functions, and a u area management function.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 06fa71cd 09-Jun-2003 Alan Cox <alc@FreeBSD.org>

Update the vm object and page locking in exec_map_first_page(). Mark the
one still anticipated change with XXX. Otherwise, this function is done.


# fd0cc9a8 08-Jun-2003 Alan Cox <alc@FreeBSD.org>

Lock the vm object when performing vm_page_grab().


# 90af4afa 13-May-2003 John Baldwin <jhb@FreeBSD.org>

- Merge struct procsig with struct sigacts.
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.

Reviewed by: arch@
Approved by: re (rwatson)


# 2c10d16a 31-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Borrow the KSE single threading code for exec and exit. We use the check
if (p->p_numthreads > 1) and not a flag because action is only necessary
if there are other threads. The rest of the system has no need to
identify thr threaded processes.
- In kern_thread.c use thr_exit1() instead of thread_exit() if P_THREADED
is not set.


# 75b8b3b2 24-Mar-2003 John Baldwin <jhb@FreeBSD.org>

Replace the at_fork, at_exec, and at_exit functions with the slightly more
flexible process_fork, process_exec, and process_exit eventhandlers. This
reduces code duplication and also means that I don't have to go duplicate
the eventhandler locking three more times for each of at_fork, at_exec, and
at_exit.

Reviewed by: phk, jake, almost complete silence on arch@


# a5881ea5 13-Mar-2003 John Baldwin <jhb@FreeBSD.org>

- Cache a reference to the credential of the thread that starts a ktrace in
struct proc as p_tracecred alongside the current cache of the vnode in
p_tracep. This credential is then used for all later ktrace operations on
this file rather than using the credential of the current thread at the
time of each ktrace event.
- Now that we have multiple ktrace-related items in struct proc that are
pointers, rename p_tracep to p_tracevp to make it less ambiguous.

Requested by: rwatson (1)


# ac2e4153 26-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Change the process flags P_KSES to be P_THREADED.
This is just a cosmetic change but I've been meaning to do it for about a year.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 5215b187 16-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Split the struct kse into struct upcall and struct kse. struct kse will
soon be visible only to schedulers. This greatly simplifies much the
KSE code.

Submitted by: davidxu


# 6f8132a8 31-Jan-2003 Julian Elischer <julian@FreeBSD.org>

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# 0dbb100b 26-Jan-2003 David Xu <davidxu@FreeBSD.org>

Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# ec35c2af 20-Jan-2003 Robert Watson <rwatson@FreeBSD.org>

Perform VOP_GETATTR() before mac_check_vnode_exec() so that
the cached attributes are available to MAC modules.

Submitted by: mike halderman <mrh@nosc.mil>
Obtained from: TrustedBSD Project


# 3db161e0 13-Jan-2003 Matthew Dillon <dillon@FreeBSD.org>

It is possible for an active aio to prevent shared memory from being
dereferenced when a process exits due to the vmspace ref-count being
bumped. Change shmexit() and shmexit_myhook() to take a vmspace instead
of a process and call it in vmspace_dofree(). This way if it is missed
in exit1()'s early-resource-free it will still be caught when the zombie is
reaped.

Also fix a potential race in shmexit_myhook() by NULLing out
vmspace->vm_shm prior to calling shm_delete_mapping() and free().

MFC after: 7 days


# 45f603e2 06-Jan-2003 David Xu <davidxu@FreeBSD.org>

Clear some KSE fields after kse mode was turned off.


# 5dadd17b 04-Jan-2003 Jake Burkholder <jake@FreeBSD.org>

Add a sysctl to get the vm protections for the stack of the current process.
On architectures with a non-executable stack, eg sparc64, this is used by
libgcc to determine at runtime if its necessary to enable execute permissions
on a region of the stack which will be used to execute code, allowing the
call to mprotect to be avoided if the kernel is configured to map the stack
executable.


# c522c1bf 31-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

fdcopy() only needs a filedesc pointer.


# ee113343 18-Dec-2002 Alan Cox <alc@FreeBSD.org>

Hold the page queues lock when performing vm_page_busy().


# b80521fe 13-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

remove syscallarg().

Suggested by: peter


# d1989db5 26-Nov-2002 Robert Drehmel <robert@FreeBSD.org>

To avoid sleeping with all sorts of resources acquired (the reported
problem was a locked directory vnode), do not give the process a chance
to sleep in state "stopevent" (depends on the S_EXEC bit being set in
p_stops) until most resources have been released again.

Approved by: re


# 2d21129d 24-Nov-2002 Alan Cox <alc@FreeBSD.org>

Acquire and release the page queues lock around pmap_remove_pages() because
it updates several of vm_page's fields.


# a9a08882 17-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Release the imgp vnode prior to freeing exec_map resources to avoid
deadlock.


# 4fec79be 16-Nov-2002 Alan Cox <alc@FreeBSD.org>

Now that pmap_remove_all() is exported by our pmap implementations
use it directly.


# d154fb4f 10-Nov-2002 Alan Cox <alc@FreeBSD.org>

When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.

Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().


# 0c93266b 05-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Correct merge-o: disable the right execve() variation if !MAC


# 670cb89b 05-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Bring in two sets of changes:

(1) Permit userland applications to request a change of label atomic
with an execve() via mac_execve(). This is required for the
SEBSD port of SELinux/FLASK. Attempts to invoke this without
MAC compiled in result in ENOSYS, as with all other MAC system
calls. Complexity, if desired, is present in policy modules,
rather than the framework.

(2) Permit policies to have access to both the label of the vnode
being executed as well as the interpreter if it's a shell
script or related UNIX nonsense. Because we can't hold both
vnode locks at the same time, cache the interpreter label.
SEBSD relies on this because it supports secure transitioning
via shell script executables. Other policies might want to
take both labels into account during an integrity or
confidentiality decision at execve()-time.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# ccafe7eb 05-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Hook up the mac_will_execve_transition() and mac_execve_transition()
entrypoints, #ifdef MAC. The supporting logic already existed in
kern_mac.c, so no change there. This permits MAC policies to cause
a process label change as the result of executing a binary --
typically, as a result of executing a specially labeled binary.

For example, the SEBSD port of SELinux/FLASK uses this functionality
to implement TE type transitions on processes using transitioning
binaries, in a manner similar to setuid. Policies not implementing
a notion of transition (all the ones in the tree right now) require
no changes, since the old label data is copied to the new label
via mac_create_cred() even if a transition does occur.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 450ffb44 04-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Remove reference to struct execve_args from struct imgact, which
describes an image activation instance. Instead, make use of the
existing fname structure entry, and introduce two new entries,
userspace_argv, and userspace_envv. With the addition of
mac_execve(), this divorces the image structure from the specifics
of the execve() system call, removes a redundant pointer, etc.
No semantic change from current behavior, but it means that the
structure doesn't depend on syscalls.master-generated includes.

There seems to be some redundant initialization of imgact entries,
which I have maintained, but which could probably use some cleaning
up at some point.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# e1b1aa3b 11-Oct-2002 John Baldwin <jhb@FreeBSD.org>

- Move the 'done1' label down below the unlock of the proc lock and move
the locking of the proc lock after the goto to done1 to avoid locking
the lock in an error case just so we can turn around and unlock it.
- Move the exec_setregs() stuff out from under the proc lock and after
the p_args stuff. This allows exec_setregs() to be able to sleep or
write things out to userland, etc. which ia64 does.

Tested by: peter


# 05ba50f5 21-Sep-2002 Jake Burkholder <jake@FreeBSD.org>

Use the fields in the sysentvec and in the vm map header in place of the
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.


# c1e2d386 14-Sep-2002 Nate Lawson <njl@FreeBSD.org>

Move setugidsafety() call outside of process lock. This prevents a lock
recursion when closef() calls pfind() which also wants the proc lock.
This case only occurred when setugidsafety() needed to close unsafe files.

Reviewed by: truckman


# 28b325aa 13-Sep-2002 Don Lewis <truckman@FreeBSD.org>

Drop the proc lock while calling fdcheckstd() which may block to allocate
memory.

Reviewed by: jhb


# 1279572a 05-Sep-2002 David Xu <davidxu@FreeBSD.org>

s/SGNL/SIG/
s/SNGL/SINGLE/
s/SNGLE/SINGLE/

Fix abbreviation for P_STOPPED_* etc flags, in original code they were
inconsistent and difficult to distinguish between them.

Approved by: julian (mentor)


# f36ba452 01-Sep-2002 Jake Burkholder <jake@FreeBSD.org>

Added fields for VM_MIN_ADDRESS, PS_STRINGS and stack protections to
sysentvec. Initialized all fields of all sysentvecs, which will allow
them to be used instead of constants in more places. Provided stack
fixup routines for emulations that previously used the default.


# bafbd492 29-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

Renamed poorly named setregs to exec_setregs. Moved its prototype to
imgact.h with the other exec support functions.


# f3bec5d7 28-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

Don't require that sysentvec.sv_szsigcode be non-NULL.


# 81f223ca 25-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

Fixed most indentation bugs.


# ca0387ef 25-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

Fixed placement of operators. Wrapped long lines.


# fd559a8a 24-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

Fixed white space around operators, casts and reserved words.

Reviewed by: md5


# a7cddfed 24-Aug-2002 Jake Burkholder <jake@FreeBSD.org>

return x; -> return (x);
return(x); -> return (x);

Reviewed by: md5


# 49539972 22-Aug-2002 Julian Elischer <julian@FreeBSD.org>

slight cleanup of single-threading code for KSE processes


# 619eb6e5 13-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Hold the vnode lock throughout execve.
- Set VV_TEXT in the top level execve code.
- Fixup the image activators to deal with the newly locked vnode.


# 339b79b9 01-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke an appropriate MAC entry point to authorize execution of
a file by a process. The check is placed slightly differently
than it appears in the trustedbsd_mac tree so that it prevents
a little more information leakage about the target of the execve()
operation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 89ab9307 30-Jul-2002 Jacques Vidrine <nectar@FreeBSD.org>

For processes which are set-user-ID or set-group-ID, the kernel performs a few
special actions for safety. One of these is to make sure that file descriptors
0..2 are in use, by opening /dev/null for those that are not already open.
Another is to close any file descriptors 0..2 that reference procfs. However,
these checks were made out of order, so that it was still possible for a
set-user-ID or set-group-ID process to be started with some of the file
descriptors 0..2 unused.

Submitted by: Georgi Guninski <guninski@guninski.com>


# d06c0d4d 27-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Slight restructuring of the logic for credential change case identification
during execve() to use a 'credential_changing' variable. This makes it
easier to have outstanding patchsets against this code, as well as to
add conditionally defined clauses.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 3ebc1248 19-Jul-2002 Peter Wemm <peter@FreeBSD.org>

Infrastructure tweaks to allow having both an Elf32 and an Elf64 executable
handler in the kernel at the same time. Also, allow for the
exec_new_vmspace() code to build a different sized vmspace depending on
the executable environment. This is a big help for execing i386 binaries
on ia64. The ELF exec code grows the ability to map partial pages when
there is a page size difference, eg: emulating 4K pages on 8K or 16K
hardware pages.

Flesh out the i386 emulation support for ia64. At this point, the only
binary that I know of that fails is cvsup, because the cvsup runtime
tries to execute code in pages not marked executable.

Obtained from: dfr (mostly, many tweaks from me).


# b3afd20d 14-Jul-2002 Alan Cox <alc@FreeBSD.org>

In execve(), delay the acquisition of Giant until after kmem_alloc_wait().
(Operations on the exec_map don't require Giant.)


# 63c9e754 12-Jul-2002 John Baldwin <jhb@FreeBSD.org>

We don't need to clear oldcred here since newcred is not NULL yet.


# 97646f56 11-Jul-2002 Alan Cox <alc@FreeBSD.org>

o Lock accesses to the page queues.


# 0b2ed1ae 06-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Clean up execve locking:

- Grab the vnode object early in exec when we still have the vnode lock.
- Cache the object in the image_params.
- Make use of the cached object in imgact_*.c


# c781aea8 01-Jul-2002 Peter Wemm <peter@FreeBSD.org>

#include <sys/ktrace.h> would be useful too. (for ktrace_mtx)


# 1e9b3d91 01-Jul-2002 Peter Wemm <peter@FreeBSD.org>

Add #include "opt_ktrace.h"


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 7f05b035 28-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.


# 366838dd 25-Jun-2002 Alan Cox <alc@FreeBSD.org>

o Eliminate vmspace::vm_minsaddr. It's initialized but never used.
o Replace stale comments in vmspace by "const until freed" annotations
on some fields.


# 69be5db9 20-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

Don't leak resources if fdcheckstd() fails during exec.

Submitted by: Mike Makonnen <makonnen@pacbell.net>


# 1419eacb 19-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

Squish the "could sleep with process lock" messages caused by calling
uifind() with a proc lock held.

change_ruid() and change_euid() have been modified to take a uidinfo
structure which will be pre-allocated by callers, they will then
call uihold() on the uidinfo structure so that the caller's logic
is simplified.

This allows one to call uifind() before locking the proc struct and
thereby avoid a potential blocking allocation with the proc lock
held.

This may need revisiting, perhaps keeping a spare uidinfo allocated
per process to handle this situation or re-examining if the proc
lock needs to be held over the entire operation of changing real
or effective user id.

Submitted by: Don Lewis <dl-freebsd@catspoiler.org>


# 6c84de02 06-Jun-2002 John Baldwin <jhb@FreeBSD.org>

Properly lock accesses to p_tracep and p_traceflag. Also make a few
ktrace-only things #ifdef KTRACE that were not before.


# 9b3b1c5f 02-May-2002 John Baldwin <jhb@FreeBSD.org>

- Reorder execve() so that it performs blocking operations before it
locks the process.
- Defer other blocking operations such as vrele()'s until after we
release locks.
- execsigs() now requires the proc lock to be held when it is called
rather than locking the process internally.


# e983a376 18-Apr-2002 Jacques Vidrine <nectar@FreeBSD.org>

When exec'ing a set[ug]id program, make sure that the stdio file descriptors
(0, 1, 2) are allocated by opening /dev/null for any which are not already
open.

Reviewed by: alfred, phk
MFC after: 2 days


# 911fc923 04-Apr-2002 Peter Wemm <peter@FreeBSD.org>

Increase the size of the register stack storage on ia64 from 32K to 2MB so
that we can compile gcc. This is a hack because it adds a fixed 2MB to
each process's VSIZE regardless of how much is really being used since
there is no grow-up stack support. At least it isn't physical memory.
Sigh.

Add a sysctl to enable tweaking it for new processes.


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 5e20c11f 30-Mar-2002 Alan Cox <alc@FreeBSD.org>

Add a local proc *p in exec_new_vmspace() to avoid repeated dereferencing
to obtain it.


# 8899023f 27-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Make the reference counting of 'struct pargs' SMP safe.

There is still some locations where the PROC lock should be held
in order to prevent inconsistent views from outside (like the
proc->p_fd fix for kern/vfs_syscalls.c:checkdirs()) that can be
fixed later.

Submitted by: Jonathan Mini <mini@haikugeek.com>


# cb100b25 26-Mar-2002 Alan Cox <alc@FreeBSD.org>

Remove an unnecessary and inconsistently used variable from exec_new_vmspace().


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# ac59490b 16-Mar-2002 Jake Burkholder <jake@FreeBSD.org>

Convert all pmap_kenter/pmap_kremove pairs in MI code to use pmap_qenter/
pmap_qremove. pmap_kenter is not safe to use in MI code because it is not
guaranteed to flush the mapping from the tlb on all cpus. If the process
in question is preempted and migrates cpus between the call to pmap_kenter
and pmap_kremove, the original cpu will be left with stale mappings in its
tlb. This is currently not a problem for i386 because we do not use PG_G on
SMP, and thus all mappings are flushed from the tlb on context switches, not
just user mappings. This is not the case on all architectures, and if PG_G
is to be used with SMP on i386 it will be a problem. This was committed by
peter earlier as part of his fine grained tlb shootdown work for i386, which
was backed out for other reasons.

Reviewed by: peter


# 0cf3c909 27-Feb-2002 Warner Losh <imp@FreeBSD.org>

Remove now unused struct proc *p.

Approved by: jhb


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# d1693e17 27-Feb-2002 Peter Wemm <peter@FreeBSD.org>

Back out all the pmap related stuff I've touched over the last few days.
There is some unresolved badness that has been eluding me, particularly
affecting uniprocessor kernels. Turning off PG_G helped (which is a bad
sign) but didn't solve it entirely. Userland programs still crashed.


# bd1e3a0f 26-Feb-2002 Peter Wemm <peter@FreeBSD.org>

Jake further reduced IPI shootdowns on sparc64 in loops by using ranged
shootdowns in a couple of key places. Do the same for i386. This also
hides some physical addresses from higher levels and has it use the
generic vm_page_t's instead. This will help for PAE down the road.

Obtained from: jake (MI code, suggestions for MD part)


# 079b7bad 07-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# 6f5dafea 13-Jan-2002 Alan Cox <alc@FreeBSD.org>

o Call the functions registered with at_exec() from exec_new_vmspace()
instead of execve(). Otherwise, the possibility still exists
for a pending AIO to modify the new address space.

Reviewed by: alfred


# 426da3bc 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# 21d56e9c 29-Dec-2001 Alfred Perlstein <alfred@FreeBSD.org>

Make AIO a loadable module.

Remove the explicit call to aio_proc_rundown() from exit1(), instead AIO
will use at_exit(9).

Add functions at_exec(9), rm_at_exec(9) which function nearly the
same as at_exec(9) and rm_at_exec(9), these functions are called
on behalf of modules at the time of execve(2) after the image
activator has run.

Use a modified version of tegge's suggestion via at_exec(9) to close
an exploitable race in AIO.

Fix SYSCALL_MODULE_HELPER such that it's archetecuterally neutral,
the problem was that one had to pass it a paramater indicating the
number of arguments which were actually the number of "int". Fix
it by using an inline version of the AS macro against the syscall
arguments. (AS should be available globally but we'll get to that
later.)

Add a primative system for dynamically adding kqueue ops, it's really
not as sophisticated as it should be, but I'll discuss with jlemon when
he's around.


# 91f91617 09-Dec-2001 David E. O'Brien <obrien@FreeBSD.org>

Repeat after me -- "Use of ANSI string concatenation can be bad."
In this case, C99's __func__ is properly defined as:

static const char __func__[] = "function-name";

and GCC 3.1 will not allow it to be used in bogus string concatenation.


# c3699b5f 07-Nov-2001 Peter Wemm <peter@FreeBSD.org>

For what its worth, sync up the type of ps_arg_cache_max (unsigned long)
with the sysctl type (signed long).


# 9ca45e81 27-Oct-2001 Dag-Erling Smørgrav <des@FreeBSD.org>

Add a P_INEXEC flag that indicates that the process has called execve() and
it has not yet returned. Use this flag to deny debugging requests while
the process is execve()ing, and close once and for all any race conditions
that might occur between execve() and various debugging interfaces.

Reviewed by: jhb, rwatson


# 9a024fc5 24-Oct-2001 Robert Drehmel <robert@FreeBSD.org>

Use vm_offset_t instead of caddr_t to fix a warning and remove
two casts.


# 79deba82 23-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix ktrace enablement/disablement races that can result in a vnode
ref count panic.

Bug noticed by: ps
Reviewed by: ps
MFC after: 1 day


# cbc89bfb 10-Oct-2001 Paul Saab <ps@FreeBSD.org>

Make MAXTSIZ, DFLDSIZ, MAXDSIZ, DFLSSIZ, MAXSSIZ, SGROWSIZ loader
tunable.

Reviewed by: peter
MFC after: 2 weeks


# e913ca22 10-Oct-2001 Doug Rabson <dfr@FreeBSD.org>

Move setregs() out from under the PROC_LOCK so that it can use functions
list suword() which may trap.


# 8688bb93 09-Oct-2001 John Baldwin <jhb@FreeBSD.org>

proces -> process in a comment.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 116734c4 31-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Pushdown Giant for acct(), kqueue(), kevent(), execve(), fork(),
vfork(), rfork(), jail().


# b8c526df 22-Aug-2001 Alexander Langer <alex@FreeBSD.org>

Fix a simple typo I just happened to find.


# b2c3fa70 10-Jul-2001 Dima Dorfman <dd@FreeBSD.org>

Correct spelling in a comment and remove trailing newline from a
panic() call (panic() adds it itself).


# 333ea485 09-Jul-2001 Guido van Rooij <guido@FreeBSD.org>

Don't share sig handlers after an exec

Reviewed by: Alfred Perlstein


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# fbd26f75 20-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Fix some lock order reversals where we called free() while holding a proc
lock. We now use temporary variables to save the process argument pointer
and just update the pointer while holding the lock. We then perform the
free on the cached pointer after releasing the lock.


# b85db196 16-Jun-2001 Peter Wemm <peter@FreeBSD.org>

Move setugid() a little sooner to before we release tracing in case
crdup() or change_e*id() block on malloc() or mutex.


# 7cb8e4d2 26-May-2001 Robert Watson <rwatson@FreeBSD.org>

o pcred-removal changes included modifications to optimize the setting of
the saved uid and gid during execve(). Unfortunately, the optimizations
were incorrect in the case where the credential was updated, skipping
the setting of the saved uid and gid when new credentials were generated.
This change corrects that problem by handling the newcred!=NULL case
correctly.

Reported/tested by: David Malone <dwmalone@maths.tcd.ie>

Obtained from: TrustedBSD Project


# b1fc0ec1 25-May-2001 Robert Watson <rwatson@FreeBSD.org>

o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit


# d8aad40c 21-May-2001 John Baldwin <jhb@FreeBSD.org>

Axe unneeded spl()'s.


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# e65897c3 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Proc locking.


# 1a6e52d0 06-Feb-2001 Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>

Fix typo: seperate -> separate.

Seperate does not exist in the english language.


# 98f03f90 23-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

Protect proc.p_pptr and proc.p_children/p_sibling with the
proctree_lock.

linprocfs not locked pending response from informal maintainer.

Reviewed by: jhb, -smp@


# cf64863a 30-Nov-2000 Robert Watson <rwatson@FreeBSD.org>

o Add a comment to exec_check_permissions() to indicate that the
passed vnode must be locked; this is the case because of calls
to VOP_GETATTR(), VOP_ACCESS(), and VOP_OPEN(). This becomes
more of an issue when VOP_ACCESS() gets a bit more complicated,
which it does when you introduce ACL, Capability, and MAC
support.

Obtained from: TrustedBSD Project


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# 63c47a5c 12-Oct-2000 Doug Rabson <dfr@FreeBSD.org>

Add a gross hack for ia64 to allocate the backing store for a new program.


# b9a22da4 25-Sep-2000 Takanori Watanabe <takawata@FreeBSD.org>

Make size of dynamic loader argument variable to support
various executable file format.

Reviewed by: peter


# eabc23ef 21-Sep-2000 Don Lewis <truckman@FreeBSD.org>

Remove unneeded #include that was a remnant of an earlier version of
my uidinfo patch.

Found by: phk


# 621dbe43 16-Sep-2000 Bruce Evans <bde@FreeBSD.org>

Added used include of <sys/mutex.h> (don't depend on pollution in
<sys/signalvar.h>).


# 9ff5ce6b 12-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# f535380c 05-Sep-2000 Don Lewis <truckman@FreeBSD.org>

Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().


# 9701cd40 05-Jul-2000 John Baldwin <jhb@FreeBSD.org>

Support for unsigned integer and long sysctl variables. Update the
SYSCTL_LONG macro to be consistent with other integer sysctl variables
and require an initial value instead of assuming 0. Update several
sysctl variables to use the unsigned types.

PR: 15251
Submitted by: Kelly Yancey <kbyanc@posi.net>


# 2c9b67a8 30-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unneeded #include <vm/vm_zone.h>

Generated by: src/tools/tools/kerninclude


# d323ddf3 26-Apr-2000 Matthew Dillon <dillon@FreeBSD.org>

Fix #! script exec under linux emulation. If a script is exec'd from a
program running under linux emulation, the script binary is checked for
in /compat/linux first. Without this patch the wrong script binary
(i.e. the FreeBSD binary) will be run instead of the linux binary.
For example, #!/bin/sh, thus breaking out of linux compatibility mode.

This solves a number of problems people have had installing linux
software on FreeBSD boxes.


# ed6aff73 18-Apr-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unneeded <sys/buf.h> includes.

Due to some interesting cpp tricks in lockmgr, the LINT kernel shrinks
by 924 bytes.


# cb679c38 16-Apr-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce kqueue() and kevent(), a kernel event notification facility.


# 5e266442 20-Jan-2000 Warner Losh <imp@FreeBSD.org>

When we are execing a setugid program, and we have a procfs filesystem
file open in one of the special file descriptors (0, 1, or 2), close
it before completing the exec.

Submitted by: nergal@idea.avet.com.pl
Constructive comments: deraadt@openbsd.org, sef, peter, jkh


# 654f6be1 27-Dec-1999 Bruce Evans <bde@FreeBSD.org>

Changed the type used to represent the user stack pointer from `long *'
to `register_t *'. This fixes bugs like misplacement of argc and argv
on the user stack on i386's with 64-bit longs. We still use longs to
represent "words" like argc and argv, and assume that they are on the
stack (and that there is stack). The suword() and fuword() families
should also use register_t.


# 762e6b85 15-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Introduce NDFREE (and remove VOP_ABORTOP)


# a8704f89 26-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Add a sysctl to control if argv is disclosed to the world:
kern.ps_argsopen
It defaults to 1 which means that all users can see all argvs in ps(1).

Reviewed by: Warner


# b9df5231 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce commandline caching in the kernel.

This fixes some nasty procfs problems for SMP, makes ps(1) run much faster,
and makes ps(1) even less dependent on /proc which will aid chroot and
jails alike.

To disable this facility and revert to previous behaviour:
sysctl -w kern.ps_arg_cache_limit=0

For full details see the current@FreeBSD.org mail-archives.


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# fdf4e8b3 11-Aug-1999 Warner Losh <imp@FreeBSD.org>

Stop profiling on exec.

Obtained from: NetBSD


# f711d546 27-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Suser() simplification:

1:
s/suser/suser_xxx/

2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.

3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/

The remaining suser_xxx() calls will be scrutinized and dealt with
later.

There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.

More changes to the suser() API will come along with the "jail" code.


# db42d908 19-Apr-1999 Peter Wemm <peter@FreeBSD.org>

unifdef -DVM_STACK - it's been on for a while for x86 and was checked
and appeared to be working for the Alpha some time ago.


# 4fe88fe6 03-Apr-1999 John Polstra <jdp@FreeBSD.org>

Restore support for executing BSD/OS binaries on the i386 by passing
the address of the ps_strings structure to the process via %ebx.
For other kinds of binaries, %ebx is still zeroed as before.

Submitted by: Thomas Stephens <tas@stephens.org>
Reviewed by: jdp


# b1028ad1 19-Feb-1999 Luoqi Chen <luoqi@FreeBSD.org>

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# d254af07 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 2267af78 06-Jan-1999 Julian Elischer <julian@FreeBSD.org>

Add (but don't activate) code for a special VM option to make
downward growing stacks more general.
Add (but don't activate) code to use the new stack facility
when running threads, (specifically the linux threads support).
This allows people to use both linux compiled linuxthreads, and also the
native FreeBSD linux-threads port.

The code is conditional on VM_STACK. Not using this will
produce the old heavily tested system.

Submitted by: Richard Seaman <dick@tar.com>


# 9c0fed3d 30-Dec-1998 Doug Rabson <dfr@FreeBSD.org>

Various changes to support OSF1 emulation:

* Move the user stack from VM_MAXUSER_ADDRESS to a place below the 32bit
boundary (needed to support 32bit OSF programs). This should also save
one pagetable per process.
* Add cvtqlsv to the set of instructions handled by the floating point
software completion code.
* Disable all floating point exceptions by default.
* A minor change to execve to allow the OSF1 image activator to support
dynamic loading.


# 486bddb0 27-Dec-1998 Doug Rabson <dfr@FreeBSD.org>

Fix some 64bit truncation problems which crept into SYSCTL_LONG() with the
last cleanup. Since the oid_arg2 field of struct sysctl_oid is not wide
enough to hold a long, the SYSCTL_LONG() macro has been modified to only
support exporting long variables by pointer instead of by value.

Reviewed by: bde


# 4c56fcde 16-Dec-1998 Bruce Evans <bde@FreeBSD.org>

Removed the cast to a pointer in the definition of PS_STRINGS and
adjusted related casts to match (only in the kernel in this commit).
The pointer was only wanted in one place in kern_exec.c. Applications
should use the kern.ps_strings sysctl instead of PS_STRINGS, so they
shouldn't notice this change.


# 2caeccee 16-Dec-1998 Bruce Evans <bde@FreeBSD.org>

Removed all traces of SYSCTL_INTPTR(). Pointers can't really be passed
across the kernel -> application interface, and for the one sysctl where
they were passed and actually used (kern.ps_strings), the applications
want addresses represented as u_longs anyway (the other sysctl that
passed them, kern.usrstack, has never been used).

Agreed to by: dfr, phk


# 73007561 28-Oct-1998 David Greenman <dg@FreeBSD.org>

Added a second argument, "activate" to the vm_page_unwire() call so that
the caller can select either inactive or active queue to put the page on.


# aa855a59 15-Oct-1998 Peter Wemm <peter@FreeBSD.org>

*gulp*. Jordan specifically OK'ed this..

This is the bulk of the support for doing kld modules. Two linker_sets
were replaced by SYSINIT()'s. VFS's and exec handlers are self registered.
kld is now a superset of lkm. I have converted most of them, they will
follow as a seperate commit as samples.
This all still works as a static a.out kernel using LKM's.


# e69763a3 04-Sep-1998 Doug Rabson <dfr@FreeBSD.org>

Cosmetic changes to the PAGE_XXX macros to make them consistent with
the other objects in vm.


# 069e9bc1 24-Aug-1998 Doug Rabson <dfr@FreeBSD.org>

Change various syscalls to use size_t arguments instead of u_int.

Add some overflow checks to read/write (from bde).

Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags
and vm_object::paging_in_progress to use operations which are not
interruptable.

Reviewed by: Bruce Evans <bde@zeta.org.au>


# aae0aa45 15-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Cast between longs and pointers via intptr_t. The results of fuword()
should be checked before casting. The results of suword() should be
checked.


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# dc733423 17-Apr-1998 Dag-Erling Smørgrav <des@FreeBSD.org>

Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.


# eed2412e 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

Free the first page also if it is not valid.


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# c8a79999 01-Mar-1998 Peter Wemm <peter@FreeBSD.org>

Update the ELF image activator to use some of the exec resources rather
than rolling it's own. This means that it now uses the "safe"
exec_map_first_page() to get the ld.so headers rather than risking a panic
on a page fault failure (eg: NFS server goes down).
Since all the ELF tools go to a lot of trouble to make sure everything
lives in the first page for executables, this is a win. I have not seen
any ELF executable on any system where all the headers didn't fit in the
first page with lots of room to spare.
I have been running variations of this code for some time on my pure ELF
systems.


# 5132080e 25-Feb-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 95461b45 04-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 1616db3c 11-Jan-1998 John Dyson <dyson@FreeBSD.org>

Implement the first page access for object type determination more
VM clean. Also, use vm_map_insert instead of vm_mmap.
Reviewed by: dg@freebsd.org


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 675ea6f0 26-Dec-1997 Bruce Evans <bde@FreeBSD.org>

Unspammed nested include of <vm/vm_zone.h>.


# d5f81602 19-Dec-1997 Sean Eric Fagan <sef@FreeBSD.org>

Clear the p_stops field on change of user/group id, unless the correct
flag is set in the p_pfsflags field. This, essentially, prevents an SUID
proram from hanging after being traced. (E.g., "truss /usr/bin/rlogin" would
fail, but leave rlogin in a stopevent state.) Yet another case where procctl
is (hopefully ;)) no longer needed in the general case.

Reviewed by: bde (thanks bruce :))


# c7ce9e26 16-Dec-1997 David Greenman <dg@FreeBSD.org>

Fix bug where a struct buf was free()'d back to the system malloc pool.
Quite amazing that the system runs at all with this bug. Also present in
2.2.5. The bug appears to have come in with changes in rev 1.53.

PR: might fix PR#5313
Submitted by: bde


# 2a024a2b 05-Dec-1997 Sean Eric Fagan <sef@FreeBSD.org>

Changes to allow event-based process monitoring and control.


# cb226aaa 06-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# d021ae3d 15-Oct-1997 Guido van Rooij <guido@FreeBSD.org>

On execing a sgid program, do not set P_SUGID when cr_gid and cr)_uid
do not change.
PR: 4755
Reviewed by: Bruce Evans


# 99448ed1 20-Sep-1997 John Dyson <dyson@FreeBSD.org>

Change the M_NAMEI allocations to use the zone allocator. This change
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# a78e8d2a 03-Aug-1997 David Greenman <dg@FreeBSD.org>

Fixed security hole with sharing the file descriptor table (via rfork)
when execing a setuid/setgid binary. Code submitted by Sean Eric Fagan
(sef@freebsd.org).
Also consolidated the setuid/setgid checks into one place.
Reviewed by: dyson,sef


# 5cf3d12c 23-Apr-1997 Andrey A. Chernov <ache@FreeBSD.org>

Don't clobber user space argv0 memory on shell exec, mainly for vfork()
Fix another bug: if argv[0] is NULL, garbadge args might be added for
shell script
Submitted by: Tor Egge <Tor.Egge@idi.ntnu.no> (with yet one fault detect from me)


# 1ebd0c59 17-Apr-1997 David Greenman <dg@FreeBSD.org>

Brought fix from the 2.2 branch forward (see rev 1.47.2.7): serious bugs
with reading the image header.


# 492da96c 12-Apr-1997 John Dyson <dyson@FreeBSD.org>

Correct the previous thread-fix commit. I made a clerical error.


# 5856e12e 12-Apr-1997 John Dyson <dyson@FreeBSD.org>

Fully implement vfork. Vfork is now much much faster than even our
fork. (On my machine, fork is about 240usecs, vfork is 78usecs.)

Implement rfork(!RFPROC !RFMEM), which allows a thread to divorce its memory
from the other threads of a group.

Implement rfork(!RFPROC RFCFDG), which closes all file descriptors, eliminating
possible existing shares with other threads/processes.

Implement rfork(!RFPROC RFFDG), which divorces the file descriptors for a
thread from the rest of the group.

Fix the case where a thread does an exec. It is almost nonsense for a thread
to modify the other threads address space by an exec, so we
now automatically divorce the address space before modifying it.


# c04b956c 11-Apr-1997 John Dyson <dyson@FreeBSD.org>

Effectively remove the previous commit to fix threads forking. The
change was a false-start, and needs more work.


# af9ec885 11-Apr-1997 John Dyson <dyson@FreeBSD.org>

Allow a kernel-supported process thread to do an exec without blasting
away the VM space of all of the other, associated threads.


# 66141753 04-Apr-1997 David Greenman <dg@FreeBSD.org>

Killed unnecessary vp == NULL check after namei.


# a3cf6eba 04-Apr-1997 David Greenman <dg@FreeBSD.org>

Oops, only free component name buffer if namei() didn't. This bug has
been in here since I wrote the code 3 years ago! Thanks, Bruce!

Submitted by: bde


# 6d5a0a8c 03-Apr-1997 David Greenman <dg@FreeBSD.org>

Various fixes:

1. imgp->image_header needs to be cleared for the bp == NULL && `goto
interpret' case, else exec_fail_dealloc would free it twice after
an error.

2. Moved the vp->v_writecount check in exec_check_permissions() to
near the end. This fixes execve("/dev/null", ...) returning the
bogus errno ETXTBSY. ETXTBSY is still returned for attempts to
exec interpreted files that are open for writing. The man page
is very old and wrong here. It says that ETXTBSY is for pure
procedure (shared text) files that are open for writing or reading.

3. Moved the setuid disabling in exec_check_permissions() to the end.
Cosmetic. It's more natural to dispose of all the error cases
first.

...plus a couple of other cosmetic changes.

Submitted by: bde


# 8677f509 03-Apr-1997 David Greenman <dg@FreeBSD.org>

Lose the vnode lock on a permissions failure.

Submitted by: Tor Egge <Tor.Egge@idi.ntnu.no>


# 9caaadb6 31-Mar-1997 David Greenman <dg@FreeBSD.org>

Changed the way that the exec image header is read to be filesystem-
centric rather than VM-centric to fix a problem with errors not being
detectable when the header is read.
Killed exech_map as a result of these changes.
There appears to be no performance difference with this change.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# e47bda07 18-Feb-1997 David Greenman <dg@FreeBSD.org>

Fix from PR #2757:

execve() clears the P_SUGID process flag in execve() if the binary
executed does not have suid or sgid permission bits set.

This also happens when the effective uid is different from the real
uid or the effective gid is different from the real gid. Under
these circumstances, the process still has set id privileges and
the P_SUGID flag should not be cleared.

Submitted by: Tor Egge <Tor.Egge@idt.ntnu.no>


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 2cb544c3 08-Nov-1996 John Dyson <dyson@FreeBSD.org>

Fix an ordering bug -- pmap_remove_pages should be called BEFORE
vm_map_remove, not after...

2.2-RELEASE candidate.


# 9d3fbbb5 12-Oct-1996 John Dyson <dyson@FreeBSD.org>

Performance optimizations. One of which was meant to go in before the
previous snap. Specifically, kern_exit and kern_exec now makes a
call into the pmap module to do a very fast removal of pages from the
address space. Additionally, the pmap module now updates the PG_MAPPED
and PG_WRITABLE flags. This is an optional optimization, but helpful
on the X86.


# 67bf6868 29-Jul-1996 John Dyson <dyson@FreeBSD.org>

Backed out the recent changes/enhancements to the VM code. The
problem with the 'shell scripts' was found, but there was a 'strange'
problem found with a 486 laptop that we could not find. This commit
backs the code back to 25-jul, and will be re-entered after the snapshot
in smaller (more easily tested) chunks.


# 4f4d35ed 26-Jul-1996 John Dyson <dyson@FreeBSD.org>

This commit is meant to solve a couple of VM system problems or
performance issues.

1) The pmap module has had too many inlines, and so the
object file is simply bigger than it needs to be.
Some common code is also merged into subroutines.
2) Removal of some *evil* PHYS_TO_VM_PAGE macro calls.
Unfortunately, a few have needed to be added also.
The removal caused the need for more vm_page_lookups.
I added lookup hints to minimize the need for the
page table lookup operations.
3) Removal of some bogus performance improvements, that
mostly made the code more complex (tracking individual
page table page updates unnecessarily). Those improvements
actually hurt 386 processors perf (not that people who
worry about perf use 386 processors anymore :-)).
4) Changed pv queue manipulations/structures to be TAILQ's.
5) The pv queue code has had some performance problems since
day one. Some significant scalability issues are resolved
by threading the pv entries from the pmap AND the physical
address instead of just the physical address. This makes
certain pmap operations run much faster. This does
not affect most micro-benchmarks, but should help loaded system
performance *significantly*. DG helped and came up with most
of the solution for this one.
6) Most if not all pmap bit operations follow the pattern:
pmap_test_bit();
pmap_clear_bit();
That made for twice the necessary pv list traversal. The
pmap interface now supports only pmap_tc_bit type operations:
pmap_[test/clear]_modified, pmap_[test/clear]_referenced.
Additionally, the modified routine now takes a vm_page_t arg
instead of a phys address. This eliminates a PHYS_TO_VM_PAGE
operation.
7) Several rewrites of routines that contain redundant code to
use common routines, so that there is a greater likelihood of
keeping the cache footprint smaller.


# 6ab46d52 11-Jul-1996 Bruce Evans <bde@FreeBSD.org>

Don't use NULL in non-pointer contexts.


# 86064318 02-Jun-1996 David Greenman <dg@FreeBSD.org>

Use kmem_alloc_wait/kmem_free_wakeup() to avoid allocation failures
from running out of string space in the exec_map.


# 6120fef1 02-Jun-1996 David Greenman <dg@FreeBSD.org>

Fix declaration of ps_strings.


# b18bfc3d 17-May-1996 John Dyson <dyson@FreeBSD.org>

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# a794e791 30-Apr-1996 Bruce Evans <bde@FreeBSD.org>

Removed unnecessary #includes from <sys/imgact.h> so that it is
self-sufficient and added explicit #includes where required.


# 24b34f09 29-Apr-1996 Sujal Patel <smpatel@FreeBSD.org>

Fixed two typos in the comment.

Pointed out by: davidg


# 39f70d45 07-Apr-1996 David Greenman <dg@FreeBSD.org>

Killed sections 3 and 4 of my copyright as I don't agree with it (I believe
it to be unnecessarily restrictive). For tty_subr.c, update to my standard
copyright.


# e1743d02 10-Mar-1996 Søren Schmidt <sos@FreeBSD.org>

First attempt at FreeBSD & Linux ELF support.

Compile and link a new kernel, that will give native ELF support, and
provide the hooks for other ELF interpreters as well.

To make native ELF binaries use John Polstras elf-kit-1.0.1..
For the time being also use his ld-elf.so.1 and put it in
/usr/libexec.

The Linux emulator has been enhanced to also run ELF binaries, it
is however in its very first incarnation.
Just get some Linux ELF libs (Slackware-3.0) and put them in the
prober place (/compat/linux/...).
I've ben able to run all the Slackware-3.0 binaries I've tried
so far.
(No it won't run quake yet :)


# d66a5066 02-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Mega-commit for Linux emulator update.. This has been stress tested under
netscape-2.0 for Linux running all the Java stuff. The scrollbars are now
working, at least on my machine. (whew! :-)

I'm uncomfortable with the size of this commit, but it's too
inter-dependant to easily seperate out.

The main changes:

COMPAT_LINUX is *GONE*. Most of the code has been moved out of the i386
machine dependent section into the linux emulator itself. The int 0x80
syscall code was almost identical to the lcall 7,0 code and a minor tweak
allows them to both be used with the same C code. All kernels can now
just modload the lkm and it'll DTRT without having to rebuild the kernel
first. Like IBCS2, you can statically compile it in with "options LINUX".

A pile of new syscalls implemented, including getdents(), llseek(),
readv(), writev(), msync(), personality(). The Linux-ELF libraries want
to use some of these.

linux_select() now obeys Linux semantics, ie: returns the time remaining
of the timeout value rather than leaving it the original value.

Quite a few bugs removed, including incorrect arguments being used in
syscalls.. eg: mixups between passing the sigset as an int, vs passing
it as a pointer and doing a copyin(), missing return values, unhandled
cases, SIOC* ioctls, etc.

The build for the code has changed. i386/conf/files now knows how
to build linux_genassym and generate linux_assym.h on the fly.

Supporting changes elsewhere in the kernel:

The user-mode signal trampoline has moved from the U area to immediately
below the top of the stack (below PS_STRINGS). This allows the different
binary emulations to have their own signal trampoline code (which gets rid
of the hardwired syscall 103 (sigreturn on BSD, syslog on Linux)) and so
that the emulator can provide the exact "struct sigcontext *" argument to
the program's signal handlers.

The sigstack's "ss_flags" now uses SS_DISABLE and SS_ONSTACK flags, which
have the same values as the re-used SA_DISABLE and SA_ONSTACK which are
intended for sigaction only. This enables the support of a SA_RESETHAND
flag to sigaction to implement the gross SYSV and Linux SA_ONESHOT signal
semantics where the signal handler is reset when it's triggered.

makesyscalls.sh no longer appends the struct sysentvec on the end of the
generated init_sysent.c code. It's a lot saner to have it in a seperate
file rather than trying to update the structure inside the awk script. :-)

At exec time, the dozen bytes or so of signal trampoline code are copied
to the top of the user's stack, rather than obtaining the trampoline code
the old way by getting a clone of the parent's user area. This allows
Linux and native binaries to freely exec each other without getting
trampolines mixed up.


# 99ac3bc8 24-Feb-1996 Peter Wemm <peter@FreeBSD.org>

Add two sysctl variables that can be read by libutil and libkvm so that
they can adapt to simple kernel VM layout changes.


# 9f29a577 20-Jan-1996 Bruce Evans <bde@FreeBSD.org>

Removed stale #includes of "opt_sysvipc.h".


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 81090119 07-Jan-1996 Peter Wemm <peter@FreeBSD.org>

(gulp!) reran makesyscalls..

sysv_ipc.c: add stub functions that either simply return (for the hooks
in kern_fork/kern_exit) or log() a messgae and call enosys() (for the
syscalls). sysv_ipc.c will become "standard" in conf/files and has
#ifs for all the permutations.


# 50c73f36 04-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert SYSV IPC to new-style options. (I hope I got everything...)
The LKMs will need an extra file, to come later.


# 87b6de2b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.


# 1ed012f9 08-Dec-1995 Peter Wemm <peter@FreeBSD.org>

Reorganise ps_strings in order to gain BSD/OS 2.0 binary compatability.
This is now in line with NetBSD as well..

Note that once this series of commits is finished, you must recompile
libkvm, then ps and maybe 'w'. If you are running the recently imported
sendmail-8.7, you should recompile that too (src/conf.c at least).


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# c2f9f36b 13-Nov-1995 David Greenman <dg@FreeBSD.org>

Use kmem_alloc_pageable/kmem_free to allocate memory instead of individual
VM map functions.


# d2d3e875 11-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.


# c52007c2 05-Nov-1995 David Greenman <dg@FreeBSD.org>

All:
Changed vnodep -> vp for consistency with the rest of the kernel, and
changed iparams -> imgp for brevity.

kern_exec.c:
Explicitly initialized some additional parts of the image_params struct
to avoid bzeroing it. Rewrote the set-id code to reduce the number of
logical tests. The rewrite exposed a mostly benign bug in the algorithm:
traced set-id images would get ktracing disabled even if the set-id didn't
happen for other reasons.


# 079cc25b 21-Oct-1995 David Greenman <dg@FreeBSD.org>

Killed a few gratuitous #include's.


# ad7507e2 07-Oct-1995 Steven Wallace <swallace@FreeBSD.org>

Remove prototype definitions from <sys/systm.h>.
Prototypes are located in <sys/sysproto.h>.

Add appropriate #include <sys/sysproto.h> to files that needed
protos from systm.h.

Add structure definitions to appropriate files that relied on sys/systm.h,
right before system call definition, as in the rest of the kernel source.

In kern_prot.c, instead of using the dummy structure "args", create
individual dummy structures named <syscall>_args. This makes
life easier for prototype generation.


# c0e5de7d 24-Aug-1995 David Greenman <dg@FreeBSD.org>

Moved setting of VTEXT flag into the appropriate image activators. This
fixes a bug where linux binaries would get the flag set inappropriately.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 3ed8a403 24-Mar-1995 David Greenman <dg@FreeBSD.org>

Use 'p' rather than 'curproc' when appropriate.


# b35ba931 24-Mar-1995 David Greenman <dg@FreeBSD.org>

Use NDINIT macro to initialize fields for namei.


# 9022e4ad 19-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed bug introduced in the previous commit - the lock must be held until
after the call to exec_check_permissions().


# f5277ae7 19-Mar-1995 David Greenman <dg@FreeBSD.org>

Lose the lock on the vnode. Changes to implement proper locking in the
vnode pager now require this.

Submitted by: John Dyson


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# a74a73d9 10-Mar-1995 David Greenman <dg@FreeBSD.org>

Removed some #include's of unnecessary include files.


# 3aed948b 28-Feb-1995 David Greenman <dg@FreeBSD.org>

No longer assume that a process's address space can be directly written to.


# 68940ac1 20-Feb-1995 David Greenman <dg@FreeBSD.org>

Use of vm_allocate() and vm_deallocate() has been deprecated.


# 1e1e0b44 14-Feb-1995 Søren Schmidt <sos@FreeBSD.org>

First attempt to run linux binaries. This is only the changes needed to
the generic kernel. The actual emulator is a separate LKM. (not finished
yet, sorry).
Submitted by: sos@freebsd.org & sef@kithrup.com


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 797f2d22 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# bb56ec4a 25-Sep-1994 Poul-Henning Kamp <phk@FreeBSD.org>

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# 2a531c80 24-Sep-1994 David Greenman <dg@FreeBSD.org>

Added support for p_textvp which stores the vnode pointer of the execed binary.


# d74643df 13-Sep-1994 David Greenman <dg@FreeBSD.org>

Added missing #ifdef SYSVSHM.


# 3d903220 13-Sep-1994 Doug Rabson <dfr@FreeBSD.org>

Added SYSV ipcs.

Obtained from: NetBSD and FreeBSD-1.1.5


# 0fd1a014 24-Aug-1994 David Greenman <dg@FreeBSD.org>

Pay attention to *all* errors from copyinstr(). This patch fixes a bug
that causes a no-panic instant reboot when bogus argv/envvs are fed to
execve().


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 93f6448c 06-Aug-1994 David Greenman <dg@FreeBSD.org>

Implemented support for the "ps_strings" structure (grrrr...) for use in
the userland library libkvm.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources