History log of /openbsd-current/sys/kern/kern_exit.c
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 1.222 03-Jun-2024 claudio

Remove the now unsued s argument to SCHED_LOCK and SCHED_UNLOCK.

The SPL level is not tacked by the mutex and we no longer need to track
this in the callers.
OK miod@ mlarkin@ tb@ jca@


# 1.221 20-May-2024 claudio

Rework interaction between sleep API and exit1() and start unlocking ps_threads

This diff adjusts how single_thread_set() accounts the threads by using
ps_threadcnt as initial value and counting all threads out that are already
parked. In single_thread_check call exit1() before decreasing ps_singlecount
this is now done in exit1().

exit1() and thread_fork() ensure that ps_threadcnt is updated with the
pr->ps_mtx held and in exit1() also account for exiting threads since
exit1() can sleep.

OK mpi@


Revision tags: OPENBSD_7_5_BASE
# 1.220 19-Jan-2024 bluhm

Backout priterator() for walking allprocess list.

This approach does not work as LIST_NEXT() of a removed element
does not return NULL. I causes a crash in syzcaller and triggers
kernel diagnostic assertion "vp->v_uvcount == 0" in sys/kern/kern_unveil.c
line 845 during reboot. Unfortunately the backout brings back the
race in fill_file() and fstat(1) may crash the kernel.

Reported-by: syzbot+54fba1c004d7383d5e85@syzkaller.appspotmail.com


# 1.219 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.218 15-Jan-2024 mvs

Introduce priterator(), the `ps_list' iterator. Some of `allprocess'
list walkthroughs have context switch within, so make exit1() wait
until the last reference released.

Reported-by: syzbot+0e9dda76c42c82c626d7@syzkaller.appspotmail.com

ok bluhm claudio


Revision tags: OPENBSD_7_4_BASE
# 1.217 29-Sep-2023 claudio

Extend single_thread_set() mode with additional flag attributes.

The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter
the behaviour of single_thread_set(). This allows explicit control
of the SINGLE_DEEP behaviour.

If SINGLE_DEEP is set the deep flag is passed to the initial check call
and by that the check will error out instead of suspending (SINGLE_UNWIND)
or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to
single_thread_set() outside of userret. E.g. at the start of sys_execve
because the proc is not allowed to call exit1() in that location.

SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor
returns BEFORE all threads have been parked. Currently this is only used by
the ptrace code and should not be used anywhere else. Not waiting for all
threads to settle is asking for trouble.

This solves an issue by using SINGLE_UNWIND in the coredump case where
the code should actually exit in case another thread crashed moments earlier.
Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since
the call to pledge_fail() is for sure not at the kernel boundary.

OK mpi@


# 1.216 21-Sep-2023 claudio

Move code inside exit1() to better spots.

- PS_PROFIL bit is moved into the process cleanup block where it belongs
- The proc read-only limit cache cleanup is moved up right after clearing
p->p_fd cache. lim_free() can potentially sleep and so needs to be
above the line where p_stat is set to SDEAD.

With and OK jca@


# 1.215 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.214 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.213 04-Sep-2023 claudio

Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.

The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.

This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).

Tested by phessler@, OK mpi@ cheloha@


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.221 20-May-2024 claudio

Rework interaction between sleep API and exit1() and start unlocking ps_threads

This diff adjusts how single_thread_set() accounts the threads by using
ps_threadcnt as initial value and counting all threads out that are already
parked. In single_thread_check call exit1() before decreasing ps_singlecount
this is now done in exit1().

exit1() and thread_fork() ensure that ps_threadcnt is updated with the
pr->ps_mtx held and in exit1() also account for exiting threads since
exit1() can sleep.

OK mpi@


Revision tags: OPENBSD_7_5_BASE
# 1.220 19-Jan-2024 bluhm

Backout priterator() for walking allprocess list.

This approach does not work as LIST_NEXT() of a removed element
does not return NULL. I causes a crash in syzcaller and triggers
kernel diagnostic assertion "vp->v_uvcount == 0" in sys/kern/kern_unveil.c
line 845 during reboot. Unfortunately the backout brings back the
race in fill_file() and fstat(1) may crash the kernel.

Reported-by: syzbot+54fba1c004d7383d5e85@syzkaller.appspotmail.com


# 1.219 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.218 15-Jan-2024 mvs

Introduce priterator(), the `ps_list' iterator. Some of `allprocess'
list walkthroughs have context switch within, so make exit1() wait
until the last reference released.

Reported-by: syzbot+0e9dda76c42c82c626d7@syzkaller.appspotmail.com

ok bluhm claudio


Revision tags: OPENBSD_7_4_BASE
# 1.217 29-Sep-2023 claudio

Extend single_thread_set() mode with additional flag attributes.

The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter
the behaviour of single_thread_set(). This allows explicit control
of the SINGLE_DEEP behaviour.

If SINGLE_DEEP is set the deep flag is passed to the initial check call
and by that the check will error out instead of suspending (SINGLE_UNWIND)
or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to
single_thread_set() outside of userret. E.g. at the start of sys_execve
because the proc is not allowed to call exit1() in that location.

SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor
returns BEFORE all threads have been parked. Currently this is only used by
the ptrace code and should not be used anywhere else. Not waiting for all
threads to settle is asking for trouble.

This solves an issue by using SINGLE_UNWIND in the coredump case where
the code should actually exit in case another thread crashed moments earlier.
Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since
the call to pledge_fail() is for sure not at the kernel boundary.

OK mpi@


# 1.216 21-Sep-2023 claudio

Move code inside exit1() to better spots.

- PS_PROFIL bit is moved into the process cleanup block where it belongs
- The proc read-only limit cache cleanup is moved up right after clearing
p->p_fd cache. lim_free() can potentially sleep and so needs to be
above the line where p_stat is set to SDEAD.

With and OK jca@


# 1.215 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.214 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.213 04-Sep-2023 claudio

Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.

The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.

This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).

Tested by phessler@, OK mpi@ cheloha@


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.220 19-Jan-2024 bluhm

Backout priterator() for walking allprocess list.

This approach does not work as LIST_NEXT() of a removed element
does not return NULL. I causes a crash in syzcaller and triggers
kernel diagnostic assertion "vp->v_uvcount == 0" in sys/kern/kern_unveil.c
line 845 during reboot. Unfortunately the backout brings back the
race in fill_file() and fstat(1) may crash the kernel.

Reported-by: syzbot+54fba1c004d7383d5e85@syzkaller.appspotmail.com


# 1.219 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.218 15-Jan-2024 mvs

Introduce priterator(), the `ps_list' iterator. Some of `allprocess'
list walkthroughs have context switch within, so make exit1() wait
until the last reference released.

Reported-by: syzbot+0e9dda76c42c82c626d7@syzkaller.appspotmail.com

ok bluhm claudio


Revision tags: OPENBSD_7_4_BASE
# 1.217 29-Sep-2023 claudio

Extend single_thread_set() mode with additional flag attributes.

The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter
the behaviour of single_thread_set(). This allows explicit control
of the SINGLE_DEEP behaviour.

If SINGLE_DEEP is set the deep flag is passed to the initial check call
and by that the check will error out instead of suspending (SINGLE_UNWIND)
or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to
single_thread_set() outside of userret. E.g. at the start of sys_execve
because the proc is not allowed to call exit1() in that location.

SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor
returns BEFORE all threads have been parked. Currently this is only used by
the ptrace code and should not be used anywhere else. Not waiting for all
threads to settle is asking for trouble.

This solves an issue by using SINGLE_UNWIND in the coredump case where
the code should actually exit in case another thread crashed moments earlier.
Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since
the call to pledge_fail() is for sure not at the kernel boundary.

OK mpi@


# 1.216 21-Sep-2023 claudio

Move code inside exit1() to better spots.

- PS_PROFIL bit is moved into the process cleanup block where it belongs
- The proc read-only limit cache cleanup is moved up right after clearing
p->p_fd cache. lim_free() can potentially sleep and so needs to be
above the line where p_stat is set to SDEAD.

With and OK jca@


# 1.215 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.214 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.213 04-Sep-2023 claudio

Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.

The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.

This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).

Tested by phessler@, OK mpi@ cheloha@


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.219 16-Jan-2024 deraadt

The kernel will now read pinsyscall tables out of PT_OPENBSD_SYSCALLS in
the main program or ld.so, and accept a submission of that information
for libc.so from ld.so via pinsyscalls(2). At system call invocation,
the syscall number is matched to the specific address it must come from.
ok kettenis, gnezdo, testing of variations by many people


# 1.218 15-Jan-2024 mvs

Introduce priterator(), the `ps_list' iterator. Some of `allprocess'
list walkthroughs have context switch within, so make exit1() wait
until the last reference released.

Reported-by: syzbot+0e9dda76c42c82c626d7@syzkaller.appspotmail.com

ok bluhm claudio


Revision tags: OPENBSD_7_4_BASE
# 1.217 29-Sep-2023 claudio

Extend single_thread_set() mode with additional flag attributes.

The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter
the behaviour of single_thread_set(). This allows explicit control
of the SINGLE_DEEP behaviour.

If SINGLE_DEEP is set the deep flag is passed to the initial check call
and by that the check will error out instead of suspending (SINGLE_UNWIND)
or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to
single_thread_set() outside of userret. E.g. at the start of sys_execve
because the proc is not allowed to call exit1() in that location.

SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor
returns BEFORE all threads have been parked. Currently this is only used by
the ptrace code and should not be used anywhere else. Not waiting for all
threads to settle is asking for trouble.

This solves an issue by using SINGLE_UNWIND in the coredump case where
the code should actually exit in case another thread crashed moments earlier.
Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since
the call to pledge_fail() is for sure not at the kernel boundary.

OK mpi@


# 1.216 21-Sep-2023 claudio

Move code inside exit1() to better spots.

- PS_PROFIL bit is moved into the process cleanup block where it belongs
- The proc read-only limit cache cleanup is moved up right after clearing
p->p_fd cache. lim_free() can potentially sleep and so needs to be
above the line where p_stat is set to SDEAD.

With and OK jca@


# 1.215 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.214 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.213 04-Sep-2023 claudio

Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.

The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.

This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).

Tested by phessler@, OK mpi@ cheloha@


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.217 29-Sep-2023 claudio

Extend single_thread_set() mode with additional flag attributes.

The mode can now be or-ed with SINGLE_DEEP or SINGLE_NOWAIT to alter
the behaviour of single_thread_set(). This allows explicit control
of the SINGLE_DEEP behaviour.

If SINGLE_DEEP is set the deep flag is passed to the initial check call
and by that the check will error out instead of suspending (SINGLE_UNWIND)
or exiting (SINGLE_EXIT). The SINGLE_DEEP flag is required in calls to
single_thread_set() outside of userret. E.g. at the start of sys_execve
because the proc is not allowed to call exit1() in that location.

SINGLE_NOWAIT skips the wait at the end of single_thread_set() and therefor
returns BEFORE all threads have been parked. Currently this is only used by
the ptrace code and should not be used anywhere else. Not waiting for all
threads to settle is asking for trouble.

This solves an issue by using SINGLE_UNWIND in the coredump case where
the code should actually exit in case another thread crashed moments earlier.
Also the SINGLE_UNWIND in pledge_fail() is now marked SINGLE_DEEP since
the call to pledge_fail() is for sure not at the kernel boundary.

OK mpi@


# 1.216 21-Sep-2023 claudio

Move code inside exit1() to better spots.

- PS_PROFIL bit is moved into the process cleanup block where it belongs
- The proc read-only limit cache cleanup is moved up right after clearing
p->p_fd cache. lim_free() can potentially sleep and so needs to be
above the line where p_stat is set to SDEAD.

With and OK jca@


# 1.215 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.214 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.213 04-Sep-2023 claudio

Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.

The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.

This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).

Tested by phessler@, OK mpi@ cheloha@


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.216 21-Sep-2023 claudio

Move code inside exit1() to better spots.

- PS_PROFIL bit is moved into the process cleanup block where it belongs
- The proc read-only limit cache cleanup is moved up right after clearing
p->p_fd cache. lim_free() can potentially sleep and so needs to be
above the line where p_stat is set to SDEAD.

With and OK jca@


# 1.215 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.214 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.213 04-Sep-2023 claudio

Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.

The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.

This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).

Tested by phessler@, OK mpi@ cheloha@


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.215 13-Sep-2023 claudio

Revert commitid: yfAefyNWibUyjkU2, ESyyH5EKxtrXGkS6 and itscfpFvJLOj8mHB;

The change to the single thread API results in crashes inside exit1()
as found by Syzkaller. There seems to be a race in the exit codepath.
What exactly fails is not really clear therefor revert for now.

This should fix the following Syzkaller reports:
Reported-by: syzbot+38efb425eada701ca8bb@syzkaller.appspotmail.com
Reported-by: syzbot+ecc0e8628b3db39b5b17@syzkaller.appspotmail.com
and maybe more.

Reverted commits:


# 1.214 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.213 04-Sep-2023 claudio

Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.

The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.

This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).

Tested by phessler@, OK mpi@ cheloha@


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.214 08-Sep-2023 claudio

Change how ps_threads and p_thr_link are locked away from using SCHED_LOCK.

The per process thread list can be traversed (read) by holding either
the KERNEL_LOCK or the per process ps_mtx (instead of SCHED_LOCK).
Abusing the SCHED_LOCK for this makes it impossible to split up the
scheduler lock into something more fine grained.

Tested by phessler@, ok mpi@


# 1.213 04-Sep-2023 claudio

Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.

The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.

This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).

Tested by phessler@, OK mpi@ cheloha@


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.213 04-Sep-2023 claudio

Protect ps_single, ps_singlecnt and ps_threadcnt by the process mutex.

The single thread API needs to lock the process to enter single thread
mode and does not need to stop the scheduler.

This code changes ps_singlecount from a count down to zero to ps_singlecnt
which counts up until equal to ps_threadcnt (in which case all threads
are properly asleep).

Tested by phessler@, OK mpi@ cheloha@


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.212 29-Aug-2023 claudio

Remove p_rtime from struct proc and replace it by passing the timespec
as argument to the tuagg_locked function.

- Remove incorrect use of p_rtime in other parts of the tree. p_rtime was
almost always 0 so including it in any sum did not alter the result.
- In main() the update of time can be further simplified since at that time
only the primary cpu is running.
- Add missing nanouptime() call in cpu_hatch() for hppa
- Rename tuagg_unlocked to tuagg_locked like it is done in the rest of
the tree.

OK cheloha@ dlg@


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.211 25-Apr-2023 claudio

Rename ps_refcnt to ps_threadcnt in struct process and implement
P_HASSIBLING() using this count.
OK mvs@ mpi@


Revision tags: OPENBSD_7_3_BASE
# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.210 29-Dec-2022 guenther

Add ktrace struct tracepoints for siginfo_t to the kernel side of
waitid(2) and __thrsigdivert(2) and teach kdump(1) to handle them.
Also report more from the siginfo_t inside PSIG tracepoints.

ok mpi@


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.209 19-Dec-2022 guenther

Add WTRAPPED opiton for waitid(2) to control whether CMD_TRAPPED
state changes are reported. That's the 6th bit, so switch to hex
constants. Adjust #if tests for consistency

ok kettenis@


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.208 05-Dec-2022 deraadt

zap a pile of dangling tabs


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.207 03-Nov-2022 guenther

Style: always use *retval and never retval[0] in syscalls,
to reflect that retval is just a single return value.

ok miod@


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.206 26-Oct-2022 kettenis

Fix handling of PGIDs in wait4(2) that I broke with the previous commit.

ok anton@, millert@


# 1.205 25-Oct-2022 kettenis

Implement waitid(2) which is now part of POSIX and used by mozilla.
This includes a change of siginfo_r which is technically an ABI break but
this should have no real-world impact since the members involved are
never touched by the kernel.

ok millert@, deraadt@


Revision tags: OPENBSD_7_2_BASE
# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.204 14-Aug-2022 jsg

remove unneeded includes in sys/kern
ok mpi@ miod@


Revision tags: OPENBSD_7_1_BASE
# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.203 31-Mar-2022 millert

Move knote_processexit() call from exit1() to the reaper().
This fixes a problem where NOTE_EXIT could be received before
the process was officially a zombie and thus not immediately
waitable. OK deraadt@ visa@


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.202 14-Feb-2022 claudio

Introduce a signal context that is used to pass signal related information
from cursig() to postsig() or the caller itself. This will simplify locking.
Also alter sigactsfree() a bit and move it into process_zap() so ps_sigacts
is always a valid pointer.
OK semarie@


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.201 28-Jan-2022 guenther

When it's the possessive of 'it', it's spelled "its", without the
apostrophe.


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.200 24-Oct-2021 jsg

use NULL not 0 for pointer values in kern
ok semarie@


Revision tags: OPENBSD_6_9_BASE OPENBSD_7_0_BASE
# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.199 12-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.198 08-Mar-2021 claudio

Revert commitid: AZrsCSWEYDm7XWuv;

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

This diff did not properly kill SINGLE_PTRACE and broke RAMDISK kernels.


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.197 08-Mar-2021 mpi

Kill SINGLE_PTRACE and use SINGLE_SUSPEND which has almost the same semantic.

single_thread_set() is modified to explicitly indicated when waiting until
sibling threads are parked is required. This is obviously not required if
a traced thread is switching away from a CPU after handling a STOP signal.

ok claudio@


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.196 15-Feb-2021 mpi

Move single_thread_set() out of KERNEL_LOCK().

Use the SCHED_LOCK() to ensure `ps_thread' isn't being modified by a sibling
when entering tsleep(9) w/o KERNEL_LOCK().

ok visa@


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.195 08-Feb-2021 mpi

Revert the convertion of per-process thread into a SMR_TAILQ.

We did not reach a consensus about using SMR to unlock single_thread_set()
so there's no point in keeping this change.


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.194 17-Jan-2021 mvs

Cache parent's pid as `ps_ppid' and use it instead of `ps_pptr->ps_pid'.
This allows us to unlock getppid(2).

ok mpi@


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.193 09-Dec-2020 mpi

Add kernel-only per-thread kqueue & helpers to initialize and free it.

This will soon be used by select(2) and poll(2).

ok anton@, visa@


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.192 07-Dec-2020 mpi

Convert the per-process thread list into a SMR_TAILQ.

Currently all iterations are done under KERNEL_LOCK() and therefor use
the *_LOCKED() variant.

From and ok claudio@


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

branches: 1.188.4; 1.188.6;
Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.191 16-Nov-2020 jsing

Prevent exit status from being clobbered on thread exit.

Ensure that EXIT_NORMAL only runs once by guarding it with PS_EXITING.

It was previously possible for EXIT_NORMAL to be run twice, depending on
which thread called exit() and the order in which the threads were torn
down. This is due to the P_HASSIBLING() check triggering the last thread
to run EXIT_NORMAL, even though it may have already been run via an exit()
call.

ok kettenis@ visa@


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.190 15-Oct-2020 cheloha

_exit(2), execve(2): tweak per-process interval timer cancellation

If we fold the for-loop iterating over each interval timer into the
helper function the result is slightly tidier than what we have now.
Rename the helper function "cancel_all_itimers".

Based on input from millert@ and kettenis@.


# 1.189 15-Oct-2020 cheloha

_exit(2), execve(2): cancel per-process interval timers safely

During _exit(2) and sometimes during execve(2) we need to cancel any
active per-process interval timers. We don't currently do this in an
MP-safe way. Both syscalls ignore the locking assumptions documented
in proc.h.

The easiest way to make them MP-safe is to use setitimer(), just like
the getitimer(2) and setitimer(2) syscalls do. To make things a bit
cleaner I have added a helper function, cancelitimer(), so the callers
don't need to fuss with an itimerval struct.

While we're here we can remove the splclock/splx dance from execve(2).
It is no longer necessary.

ok deraadt@


Revision tags: OPENBSD_6_7_BASE OPENBSD_6_8_BASE
# 1.188 18-Mar-2020 visa

Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.188 18-Mar-2020 visa

Restart child process scan in dowait4() if single_thread_wait() sleeps.
This ensures that the conditions checked are still in force. The sleep
breaks atomicity, allowing another thread to alter the state.

single_thread_set() should return immediately after sleep when called
from dowait4() because there is no guarantee that the process pr still
exists. When called from single_thread_set(), the process is that of
the calling thread, which prevents process pr from disappearing.

OK anton@, mpi@, claudio@


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.187 16-Mar-2020 mpi

Keep track of traced child under a list of orphans while they are being
reparented to a debugger process.

Also re-parent exiting traced processes to their original parent, if it
is still alive, after the debugger has seen the exit status.

Logic comes from FreeBSD pointed out by guenther@.

While here rename proc_reparent() into process_reparent() and get rid of
superfluous checks.

ok visa@


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.186 13-Mar-2020 mpi

Rename "sigacts" flag field to avoid conflict with the "process" one.

This shows that atomic_* operations should not be necessery to write
to this field unlike with the process one.

The advantage of using a somewhat-unique prefix for struct member is
moot when multiple definitions use the same prefix :o)

From Amit Kulkarni, ok claudio@


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.185 01-Mar-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD & FreeBSD.

Diagnosed with help from espie@ & guenther@.

ok claudio@, visa@


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.184 28-Feb-2020 mpi

Revert previous, diff contains an obvious bug.


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.183 12-Feb-2020 mpi

Do not reparent a traced child to ourself inside wait(2).

When a traced process _exit(2)s, its (tracing) parent tries to give it
back to the old parent. In the case where the old parent is the same
as the tracing parent, there's no need to do this dance, so simply
remove it from the list of zombies and free its descriptors.

Fix a double report via wait(2) exposed by recent changes in make and
newly imported ptrace(2) regression from NetBSD.

Diagnosed with espie@ and guenther@, ok claudio@


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.182 19-Dec-2019 mpi

Convert infinite sleeps to {m,t}sleep_nsec(9).

ok visa@


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.181 11-Dec-2019 guenther

Replace p_xstat with ps_xexit and ps_xsig
Convert those to a consolidated status when needed in wait4(), kevent(),
and sysctl()
Pass exit code and signal separately to exit1()
(This also serves as prep for adding waitid(2))

ok mpi@


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.180 04-Nov-2019 visa

Restore the old way of dispatching dead procs through idle proc.
The new way needs more thought.


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.179 02-Nov-2019 visa

Move dead procs to the reaper queue immediately after context switch.
This eliminates a forced context switch to the idle proc. In addition,
sched_exit() no longer needs to sum proc runtime because mi_switch()
will do it.

OK mpi@ a while ago


Revision tags: OPENBSD_6_6_BASE
# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.178 21-Jun-2019 visa

Make resource limit access MP-safe. So far, the copy-on-write sharing
of resource limit structs has been done between processes. By applying
copy-on-write also between threads, threads can read rlimits in
a nearly lock-free manner.

Inspired by code in DragonFly BSD and FreeBSD.

OK mpi@, agreement from jmatthew@ and anton@


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.177 13-Jun-2019 mpi

Use PWAIT instead of PUSER in exit1().

When the main thread of a MT process dies, it doesn't matter at which
priority it gets awaken to do the lasts cleanups. Not using PUSER makes
it easier to understand the existing scheduler logic.

ok visa@


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.176 01-Jun-2019 mpi

Revert to using the SCHED_LOCK() to protect time accounting.

It currently creates a lock ordering problem because SCHED_LOCK() is taken
by hardclock(). That means the "priorities" of a thread should be moved
out of the SCHED_LOCK() first in order to make progress.

Reported-by: syzbot+8e4863b3dde88eb706dc@syzkaller.appspotmail.com
via anton@ as well as by kettenis@


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.175 31-May-2019 mpi

Use a per-process mutex to protect time accounting instead of SCHED_LOCK().

Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
lock.

ok visa@, cheloha@


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.174 13-May-2019 bluhm

When killing a process, the signal is handled by any thread that
does not block the signal. If all threads block the signal, we
delivered it to the main thread. This does not conform to POSIX.
If any thread unblocks the signal, it should be delivered immediately
to this thread.
Mark such signals pending at the process instead of a single thread.
Then any thread can handle it later.
OK kettenis@ guenther@


Revision tags: OPENBSD_6_5_BASE
# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.173 23-Jan-2019 tedu

eliminate a ?: in witness mtx initializer by pushing the default one
level up.
ok guenther mpi visa


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.172 06-Jan-2019 visa

Fix unsafe use of ptsignal() in mi_switch().

ptsignal() has to be called with the kernel lock held. As ensuring the
locking in mi_switch() is not easy, and deferring the signaling using
the task API is not possible because of lock order issues in
mi_switch(), move the CPU time checking into a periodic timer where
the kernel can be locked without issues.

With this change, each process has a dedicated resource check timer.
The timer gets activated only when a CPU time limit is set. Because the
checking is not done as frequently as before, some precision is lost.

Use of timers adapted from FreeBSD.

OK tedu@

Reported-by: syzbot+2f5d62256e3280634623@syzkaller.appspotmail.com


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.171 12-Nov-2018 visa

Add a mechanism for managing asynchronous IO signal registrations.
It centralizes IO signal privilege checking and makes possible to revoke
a registration when the target process or process group is deleted.

Adapted from FreeBSD.

OK kettenis@ mpi@ guenther@


Revision tags: OPENBSD_6_4_BASE
# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.170 04-Oct-2018 kettenis

Call unveil_destroy() from exit1() instead of from the reaper. Fixes a
race between the reaper and unveil_removevnode() that would trigger a
KASSERT. At least as far as I can tell. Pointed out by semarie@

ok beck@, deraadt@


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.169 25-Aug-2018 anton

Change kcov semantics, kernel code coverage tracing is now enabled on a per
thread basis instead of process. The decision to enable on process made
development easier initially but could lead to non-deterministic results for
processes with more than one thread. This behavior matches the implementation
found on both Linux and FreeBSD.

With help and ok mpi@ visa@


# 1.168 21-Aug-2018 anton

Rework kcov kernel config. Instead of treating kcov as both an option and a
pseudo-device, get rid of the option. Enabling kcov now requires the following
line to be added to the kernel config:

pseudo-device kcov 1

This is how pseudo devices are enabled in general. A side-effect of this change
is that dev/kcov.c will no longer be compiled by default.

Prodded by deraadt@; ok mpi@ visa@


# 1.167 19-Aug-2018 anton

Add kcov(4), a kernel code coverage tracing driver. It's used in conjunction
with the syzkaller kernel fuzzer. So far, 8 distinct panics have been found and
fixed. This effort will continue.

kcov is limited to architectures using Clang as their default compiler and is
not enabled by default.

With help from mpi@, thanks!

ok kettenis@ mpi@ visa@


# 1.166 13-Aug-2018 visa

Simplify the startup of the cleaner, reaper and update threads by
passing the main function directly to kthread_create(9). The start_*
functions are mere stepping stones nowadays and can be pruned.
They used to contain more logic in the pre-kthread era.

While here, set `cleanerproc' and `syncerproc' during the thread
creation rather than expect the threads to set the proc pointer.
Also, rename `sched_sync' to `syncer_thread' to reduce confusion
with the scheduler-related functions.

OK kettenis@, deraadt@, mpi@


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.165 13-Jul-2018 beck

Unveiling unveil(2).
This brings unveil into the tree, disabled by default - Currently
this will return EPERM on all attempts to use it until we are
fully certain it is ready for people to start using, but this
now allows for others to do more tweaking and experimentation.

Still needs to send the unveil's across forks and execs before
fully enabling.

Many thanks to robert@ and deraadt@ for extensive testing.
ok deraadt@


Revision tags: OPENBSD_6_3_BASE
# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.164 10-Feb-2018 mpi

Move cleanup job control bits to their own function.

Part of the larger 'proctreelk' diff from guenther@

No functional change, ok benno@, tedu@


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision


# 1.163 30-Dec-2017 guenther

Delete unnecessary <sys/file.h> includes

ok millert@ krw@


# 1.162 28-Nov-2017 guenther

deadproc_mutex is only taken _before_ kernel_lock; exclude it from
WITNESS checking as (our) witness code isn't smart enough to let that by.

ok visa@


Revision tags: OPENBSD_6_2_BASE
# 1.161 29-Aug-2017 deraadt

Remove old deactivated pledge path code. A replacement mechanism is
being brewed.
ok beck


# 1.160 20-Apr-2017 visa

Add a port of witness(4) lock validation tool from FreeBSD.

Go-ahead from kettenis@, guenther@, deraadt@


Revision tags: OPENBSD_6_1_BASE
# 1.159 08-Feb-2017 guenther

Delete the obsolete fork/exec/exit emulation hooks.

ok mpi@ dlg@


# 1.158 07-Nov-2016 guenther

Split PID from TID, giving processes a PID unrelated to the TID of their
initial thread

ok jsing@ kettenis@


Revision tags: OPENBSD_6_0_BASE
# 1.157 25-Apr-2016 tedu

boom goes the dynamite


# 1.156 29-Mar-2016 mpi

Use a macro to check if a thread has a sibling.

Note that without locking a thread cannot claim that it is part
of a multi-threaded process using this macro.

Suggested by miod@, ok guenther@


# 1.155 06-Mar-2016 guenther

Localize some declarations to kern_exit.c: the last good reason to put
them in sys/proc.h has been removed with compat_linux

diff from Michal Mazurek (akfaew (at) jasminek.net)


Revision tags: OPENBSD_5_9_BASE
# 1.154 09-Oct-2015 deraadt

Rename tame() to pledge(). This fairly interface has evolved to be more
strict than anticipated. It allows a programmer to pledge/promise/covenant
that their program will operate within an easily defined subset of the
Unix environment, or it pays the price.


# 1.153 07-Oct-2015 deraadt

Add the tame "exec" request. This allows processes which request
"exec" to call execve(2), potentially fork(2) beforehands if they
asked for "proc". Calling execve is what "shells" (ksh, tmux, etc)
have as their primary purpose. But meantime, if such a shell has a
nasty bug, we want to mitigate the process from opening a socket or
calling 100+ other system calls. Unfortunately silver bullets are in
short supply, so if our goal is to stay in a POSIX-y environment, we
have to let shells call execve(). POSIX ate the world, so choices do
we all have?
Warning for many: silver bullets are even more rare in other OS
ecosystems, so please accept this as a narrow lowering of the bar in a
very raised environment.
Commited from a machine running tame "proc exec" ksh, make, etc.


# 1.152 11-Sep-2015 guenther

Only include <sys/tame.h> in the .c files that need it

ok deraadt@ miod@


# 1.151 28-Aug-2015 deraadt

fairly simple sizes for free(); ok tedu


# 1.150 22-Aug-2015 deraadt

Move to tame(int flags, char *paths[]) API/ABI.

The pathlist is a whitelist of dirs and files; anything else returns ENOENT.
Recommendation is to use a narrowly defined list. Also add TAME_FATTR, which
permits explicit change operations against "struct stat" fields. Some
other TAME_ flags are refined slightly.

Not cranking libc now, since nothing commited in base uses this and the
timing is uncomfortable for others. Discussed with many; thanks for a
few bug fixes from semarie, doug, guenther.
ok guenther


Revision tags: OPENBSD_5_8_BASE
# 1.149 14-Mar-2015 jsg

Remove some includes include-what-you-use claims don't
have any direct symbols used. Tested for indirect use by compiling
amd64/i386/sparc64 kernels.

ok tedu@ deraadt@


Revision tags: OPENBSD_5_7_BASE
# 1.148 16-Dec-2014 tedu

primary change: move uvm_vnode out of vnode, keeping only a pointer.
objective: vnode.h doesn't include uvm_extern.h anymore.
followup changes: include uvm_extern.h or lock.h where necessary.
ok and help from deraadt


Revision tags: OPENBSD_5_6_BASE
# 1.147 12-Jul-2014 tedu

add a size argument to free. will be used soon, but for now default to 0.
after discussions with beck deraadt kettenis.


# 1.146 11-Jul-2014 guenther

It's init as a process that's special, not init's original thread.
Remember initprocess instead of initproc.

ok matthew@ blambert@


# 1.145 08-Jul-2014 deraadt

decouple struct uvmexp into a new file, so that uvm_extern.h and sysctl.h
don't need to be married.
ok guenther miod beck jsing kettenis


# 1.144 04-Jul-2014 guenther

Track whether a process is a zombie or not yet fully built via flags
PS_{ZOMBIE,EMBRYO} on the process instead of peeking into the process's
thread data. This eliminates the need for the thread-level SDEAD state.

Change kvm_getprocs() (both the sysctl() and kvm backends) to report the
"most active" scheduler state for the process's threads.

tweaks kettenis@
feedback and ok matthew@


# 1.143 11-Jun-2014 matthew

Fix wait4 to not modify status or rusage if we return 0 because of
WNOHANG, in accordance with POSIX. Additionally, if rusage is
requested but the waited-on process did not terminate, return zero
bytes instead of kernel stack garbage.

ok deraadt, millert


# 1.142 15-May-2014 guenther

knote_processexit() needs the thread to pass down to FRELE(), so pass it
the exiting thread instead of assuming that that's ps_mainproc.
Also, panic no matter which thread of init takes it down.

ok tedu@


# 1.141 15-May-2014 guenther

Move from struct proc to process the reference-count-holding pointers
to the process's vmspace and filedescs. struct proc continues to
keep copies of the pointers, copying them on fork, clearing them
on exit, and (for vmspace) refreshing on exec.
Also, make uvm_swapout_threads() thread aware, eliminating p_swtime
in kernel.

particular testing by ajacoutot@ and sebastia@


# 1.140 18-Apr-2014 guenther

Have each thread keeps its own (counted!) reference to the process's ucreds
to avoid possible use-after-free references when swapping ids in threaded
processes. "Do I have the right creds?" checks are always made with the
threads creds.

Inspired by FreeBSD and NetBSD
"right time" deraadt@


# 1.139 17-Apr-2014 guenther

Make sure the original thread is blocked until any other threads are
completely detached from the process before letting it exit, so that
sleeping in systrace_exit() doesn't reorder them and lead to a panic.

Panic reported by Fabian Raetz (fabian.raetz (at) gmail.com)
ok tedu@


# 1.138 30-Mar-2014 guenther

Eliminates struct pcred by moving the real and saved ugids into
struct ucred; struct process then directly links to the ucred

Based on a discussion at c2k10 or so before noting that FreeBSD and
NetBSD did this too.

ok matthew@


# 1.137 26-Mar-2014 guenther

Move p_emul and p_sigcode from proc to process.
Tweak the handling of ktrace EMUL when changing ktracing: only
generate one per process (not one per thread) and pass the correct
proc pointer down to the VFS layer. Permit generating of NAMI and
CSW records inside ktrace(2) itself.

ok deraadt@ millert@


# 1.136 22-Mar-2014 guenther

Move p_sigacts from struct proc to struct process.

testing help mpi@


Revision tags: OPENBSD_5_5_BASE
# 1.135 12-Feb-2014 guenther

Eliminate the exit sig handling, which was only invokable via the
Linux-compat clone() syscall when *not* using CLONE_THREAD. pirofti@
confirms Opera runs in compat without this, so out it goes; one less hair
to choke on in kern_exit.c

ok tedu@ pirofti@


# 1.134 09-Feb-2014 kettenis

Fix the lock order reversal problem in the code that stops traced
multi-threaded processes when they receive a signal:

1. Make the parent of the process (the tracer) wait for all threads to be
stopped (in wait4(2)) instead of the thread that received the signal.
This prevents us from calling tsleep(9) recursively.

2. Assume that we already hold the kernel lock if the P_SINTR flag is set
(just like we already assumed we were holding the scheduler lock) and
don't try to grab it again.

This should fix the panic that many people reported when debugging
multi-threaded programs with gdb(1).

ok & lots of help from guenther@


# 1.133 24-Jan-2014 guenther

exit1() needs to do a final aggregation of the thread's [us]ticks
and runtime to the process totals. Also, add ktracing of struct
rusage in wait4() and getrusage().

problem pointed out by tedu@
ok deraadt@


# 1.132 21-Jan-2014 guenther

Setting p->p_p to NULL when it's still running isn't safe for statclock().
It was just for cleanliness, so be a little dirty

ok krw@, who managed to convince his clock to fire in the gap


# 1.131 20-Jan-2014 guenther

Threads can't be zombies, only processes, so change zombproc to zombprocess,
make it a list of processes, and change P_NOZOMBIE and P_STOPPED from thread
flags to process flags. Add allprocess list for the code that just wants
to see processes.

ok tedu@


# 1.130 20-Jan-2014 guenther

Move p_textvp from struct proc to struct process so that the exit code
can be further simplified.

ok kettenis@


# 1.129 25-Oct-2013 guenther

Move the declarations for dogetrusage(), itimerround(), and dowait4()
to sys/*.h headers so that the compat/linux code can use them.
Change dowait4() to not copyout() the status value, but rather leave
that for its caller, as compat/linux has to translate it, with the
side benefit of simplifying the native code.

Originally written months ago as part of the time_t work; long
memory, prodding, and ok from pirofti@


# 1.128 08-Oct-2013 guenther

Fix delivery of SIGPROF and SIGVTALRM to threaded processes by having
hardclock() set a flag on the running thread and force AST processing,
and then have the thread signal itself from userret().

idea and flag names from FreeBSD
ok jsing@


# 1.127 14-Sep-2013 guenther

Snapshots for all archs have been built, so remove the T32 code


# 1.126 13-Aug-2013 guenther

Switch time_t, ino_t, clock_t, and struct kevent's ident and data
members to 64bit types. Assign new syscall numbers for (almost
all) the syscalls that involve the affected types, including anything
with time_t, timeval, itimerval, timespec, rusage, dirent, stat,
or kevent arguments. Add a d_off member to struct dirent and replace
getdirentries() with getdents(), thus immensely simplifying and
accelerating telldir/seekdir. Build perl with -DBIG_TIME.

Bump the major on every single base library: the compat bits included
here are only good enough to make the transition; the T32 compat
option will be burned as soon as we've reached the new world are
are happy with the snapshots for all architectures.

DANGER: ABI incompatibility. Updating to this kernel requires extra
work or you won't be able to login: install a snapshot instead.

Much assistance in fixing userland issues from deraadt@ and tedu@
and build assistance from todd@ and otto@


Revision tags: OPENBSD_5_4_BASE
# 1.125 05-Jun-2013 tedu

factor out pid allocation to functions. add a small cache of recently
exited pids that won't get recycled.
ok deraadt


# 1.124 01-Jun-2013 tedu

some small style changes that are distracting me from seeing a real bug


# 1.123 07-May-2013 guenther

Merge from FreeBSD, r191313


# 1.122 06-Apr-2013 tedu

rthreads are always enabled. remove the sysctl.
ok deraadt guenther kettenis matthew


# 1.121 30-Mar-2013 tedu

vrele() is a tricky beast. it can sleep if the refcount hits zero,
leaving us with a free type function that isn't atomic. deal with this
by erasing any reachable pointers to the vnode first, then free it.
ok deraadt guenther


# 1.120 28-Mar-2013 deraadt

do not include machine/cpu.h from a .c file; it is the responsibility of
.h files to pull it in, if needed
ok tedu


Revision tags: OPENBSD_5_3_BASE
# 1.119 08-Sep-2012 kettenis

Plug a race where we're trying to kill a traced process while it is aleady
exiting. At that point ps_single may point to a proc that's already freed.
Since there is no point in killing a process that's already exiting, just
skip this step.

ok guenther@


# 1.118 02-Aug-2012 guenther

Apply profiling to all threads instead of just the thread that called
profil() by moving P_PROFIL from proc->p_flag to process->ps_flags with
matching adjustment in fork1() and exit1()

ok matthew@


Revision tags: OPENBSD_5_2_BASE
# 1.117 11-Jul-2012 guenther

exit1(EXIT_THREAD) needs to call single_thread_check() so that it
can be suspended and/or decrement pr->ps_singlecount if necessary.
With that added, the call the other direction needs to use its own
flag (EXIT_THREAD_NOCHECK) to avoid looping.

problem diagnosed from a hang naddy@ hit; ok kettenis@


# 1.116 09-Jul-2012 guenther

The linux emulation exit hook needs to be able to sleep, so call it
before changing p_stat to SDEAD

ok pirofti@


# 1.115 14-Apr-2012 kettenis

If single threading is active, drirect the SIGKILL signal we send to orphaned
traced processes to the active thread, otherwise we will deadlock resulting
in an unkillable stopped process.

ok guenther@


# 1.114 13-Apr-2012 kettenis

Backout a tiny part of the previous commit. Decrementing ps_singlecount in
exit1() is wrong, since single_thread_check() already decrements it and may
call exit1() after that. I can't reproduce the hang that this was supposed
to fix anyway.


# 1.113 13-Apr-2012 kettenis

First stab at making ptrace(2) usable for debugging multi-threaded programs.
It implements a full-stop model where all threads are stopped before handing
over control to the debugger. Events are reported as before through wait(2);
you will have to call ptrace(PT_GET_PROCESS_STATE, ...) to find out which
thread hit the event. Since this changes the size of struct ptrace_state,
you will have to recompile gdb.

ok guenther@


# 1.112 11-Apr-2012 kettenis

Move the P_WAITED flag from struct proc to struct process.

ok guenther@


# 1.111 10-Apr-2012 guenther

Make the KERN_NPROCS and KERN_MAXPROC sysctl()s and the RLIMIT_NPROC rlimit
count processes instead of threads. New sysctl()s KERN_NTHREADS and
KERN_MAXTHREAD count and limit threads. The nprocs and maxproc kernel
variables are replaced by nprocess, maxprocess, nthreads, and maxthread.

ok tedu@ mikeb@


# 1.110 06-Apr-2012 guenther

ruadd() does the summing of system and user times, so doing so again
results in bogus total times, as reported by numerous ports people.

ok miod@


# 1.109 23-Mar-2012 guenther

Make rusage totals, itimers, and profile settings per-process instead
of per-rthread. Handling of per-thread tick and runtime counters
inspired by how FreeBSD does it.

ok kettenis@


# 1.108 10-Mar-2012 guenther

Add PS_EXITING to better differentiate between the process exiting and
the main thread exiting. c.f. regress/sys/kern/main-thread-exited/


# 1.107 20-Feb-2012 guenther

First steps for making ptrace work with rthreads:
- move the P_TRACED and P_INEXEC flags, and p_oppid, p_ptmask, and
p_ptstat member from struct proc to struct process
- sort the PT_* requests into those that take a PID vs those that
can also take a TID
- stub in PT_GET_THREAD_FIRST and PT_GET_THREAD_NEXT

ok kettenis@


Revision tags: OPENBSD_5_1_BASE
# 1.106 17-Jan-2012 guenther

Reimplement mutexes, condvars, and rwlocks to eliminate bugs,
particularly the "consume the signal you just sent" hang, and putting
the wait queues in userspace.

Do cancellation handling in pthread_cond_*wait(), pthread_join(),
and sem_wait().

Add __ prefix to thr{sleep,wakeup,exit,sigdivert}() syscalls; add
'abort" argument to thrsleep to close cancellation race; make
thr{sleep,wakeup} return errno values via *retval to avoid touching
userspace errno.


# 1.105 14-Dec-2011 guenther

Handle rthreads consistently in ktrace by moving the flags and vnode into
struct process; KTRFAC_ACTIVE becomes P_INKTR. Also, save the credentials
used to open the file in sys_ktrace() and use them for all writes to the
vnode.

much feedback and ok jsing@


# 1.104 11-Dec-2011 guenther

Suspend other rthreads before dumping core or execing; make them exit
when exec succeeds.

ok jsing@


Revision tags: OPENBSD_5_0_BASE
# 1.103 25-Jul-2011 tedu

sys_wait4 properly returns int. ok deraadt


# 1.102 06-Jul-2011 art

Clean up after P_BIGLOCK removal.
KERNEL_PROC_LOCK -> KERNEL_LOCK
KERNEL_PROC_UNLOCK -> KERNEL_UNLOCK

oga@ ok


# 1.101 05-Jul-2011 guenther

Recommit the reverted sigacts change now that the NFS use-after-free
problem has been tracked down. This fixes the sharing of the signal
handling state: shared bits go in sigacts, per-rthread bits goes in
struct proc.

ok deraadt@


# 1.100 18-Apr-2011 guenther

Revert the sigacts diff: NFS can apparently retain pointers to processes
until they're zombies and then send them signals (for intr mounts). Until
that is untangled, the sigacts change is unsafe. sthen@ was the victim
for this one


# 1.99 15-Apr-2011 guenther

Correct the sharing of the signal handling state: stuff that should
be shared (p_sigignore, p_sigcatch, P_NOCLDSTOP, P_NOCLDWAIT) moves
to struct sigacts, wihle stuff that should be per rthread (ps_oldmask,
SAS_OLDMASK, ps_sigstk) moves to struct proc. Treat the coredumping
state bits (ps_sig, ps_code, ps_type, ps_sigval) as per-rthread
until our locking around coredumping is better.

Oh, and remove the old SunOS-compat ps_usertramp member.

"I like the sound of this" tedu@


# 1.98 03-Apr-2011 guenther

Move PPWAIT flag from struct proc to process, so that rthreads in
a vforked child behave correctly. Have the parent in a vfork()
wait on a (different) flag in *its* process instead of the child
to prevent a possible use-after-free. When ktracing the child
return from a fork, call it rfork if an rthread was created.

ok blambert@


Revision tags: OPENBSD_4_8_BASE OPENBSD_4_9_BASE
# 1.97 02-Aug-2010 guenther

Fix knote handling for exiting processes: when triggering a NOTE_EXIT
knote, remove it from the process's klist; after handling those,
remove and drop any remaining knotes from the process's klist. Ban
attaching knotes to processes that have started exiting or attaching
them via the pid of a thread other than the main thread.

ok tedu@, deraadt@


# 1.96 26-Jul-2010 guenther

Correct the links between threads, processes, pgrps, and sessions,
so that the process-level stuff is to/from struct process and not
struct proc. This fixes a bunch of problem cases in rthreads.
Based on earlier work by blambert and myself, but mostly written
at c2k10.

Tested by many: deraadt, sthen, krw, ray, and in snapshots


# 1.95 19-Jul-2010 guenther

Rollback the allproclk and fileheadlk addition. When grabbing an
rwlock, the thread will release biglock if it sleeps, means that
atomicity from before the rw_enter() to after it is not guaranteed.
The change didn't address those, so pulling it until it does.

"go for it" tedu@


# 1.94 29-Jun-2010 guenther

We always copy struct pcred when creating a new process, so the reference
count was always one. That's pointless, so remove the member and the code.
ok tedu@


# 1.93 29-Jun-2010 tedu

Eliminate RTHREADS kernel option in favor of a sysctl. The actual status
(not done) hasn't changed, but now it's less work to test things.
ok art deraadt


# 1.92 26-May-2010 oga

Bad tedu, no cookie.

Don't set SDEAD on the process in exit1 untile we have grabbed the
allproclk. allproclk is a rwlock and thus we may sleep to grab hold of
it. This is a big of a bugger when we just set a flag that means we
panic if we sleep.

ok art@. turns Tom Murphy's fstat panic into a deadlock instead *sigh*,
this is being looked into.


# 1.91 18-May-2010 tedu

move knote list to struct process. ok guenther


# 1.90 24-Mar-2010 tedu

Add a rwlock around the filehead and allproc lists, mainly to protect
list walkers in sysctl that can block. As a reward, no more vslock.
With some feedback from art, guenther, phessler. ok guenther.


Revision tags: OPENBSD_4_7_BASE
# 1.89 20-Dec-2009 guenther

When using ptrace(), death of the traced process should always send
SIGCHLD to the tracer, even if the real parent requested an alternate
exit signal. So, delay clearing the P_TRACED flag from exit1() to
sys_wait4() so that we don't send the wrong signal from reaper().

Originally discussed with kurt months ago
"looks good" deraadt@


# 1.88 20-Dec-2009 guenther

svr4_sys_waitsys() was seemingly implemented by copying sys_wait4()
and hacking on it. Since then, some of the details of finishing a
wait have changed (p_exitsig handling), so factor out the common
bit into into proc_finish_wait() and have both sys_wait4() and
svr4_sys_waitsys() call that to kill the divergence.

"looks good" deraadt@


# 1.87 27-Nov-2009 guenther

Change threxit() to take a pointer to a pid_t to zero out from the
kernel so that librthread can detect when a thread is completely
done with its stack without need a kqueue. The dying thread moves
itself to a GC list, other threads scan the GC list on pthread_create()
and pthread_join() and free the stack and handle once the thread's
thread id is zeroed.

"get it in" deraadt@, tedu@, cheers by others


# 1.86 05-Oct-2009 deraadt

Don't drop the big lock at the end of exit1(), but move it into the middle of
sched_exit(). This means that cpu_exit() and whatever it does (for instance
calling free(), as well as the deadproc p_hash handling are now locked as well.
This may have been one of the causes of the reaper panics, especially with
rthread patches... which were terminating a lot of threads very quickly onto
the deadproc p_hash list.
ok kurt kettenis miod


Revision tags: OPENBSD_4_6_BASE
# 1.85 24-Jun-2009 kurt

Remove extra psignal/wakeup in exit1() which can cause the parent to
receive SIGCHLD twice if scheduled before the reaper runs. diff by
guenther@ and myself. okay guenther@ deraadt@


# 1.84 03-Apr-2009 guenther

Fix SEM_UNDO handling for rthreads: use the struct process* instead
of the struct proc* as the identifier for SEM_UNDO tracking and only
call semexit() from the original thread, once the process as a whole
is exiting

ok tedu@


# 1.83 26-Mar-2009 oga

Remove cpu_wait(). It's original use was to be called from the reaper so
MD code would free resources that couldn't be freed until we were no
longer running in that processor. However, it's is unused on all
architectures since mikeb@'s tss changes on x86 earlier in the year.

ok miod@


Revision tags: OPENBSD_4_5_BASE
# 1.82 16-Dec-2008 guenther

Move the functionality of psignal() to a new function ptsignal()
that takes an additional argument "type" that indicates whether the
signal is for the process, just a particular thread, or propagated
to a thread because it's not caught or blocked. psignal() becomes
a wrapper that does the first of those.

So that sys_kill() can tell apart signals for the process and signals
for the process's original thread, the tid of the original thread
is defined as its pid + THREAD_PID_OFFSET.

ok tedu@ art@ andreas@ kurt@ "better early than late" deraadt@


# 1.81 11-Dec-2008 deraadt

a little bit of paranoia


# 1.80 06-Nov-2008 deraadt

remove a really stupid comment. Duh, of course it can block


# 1.79 31-Oct-2008 deraadt

accidental commit ... backout


# 1.78 31-Oct-2008 deraadt

kern_sysctl.c


# 1.77 30-Oct-2008 deraadt

Use msleep() in the reaper to make it not lose events. Based on discussion
PR 5609, and revisited with dlg. Tested on all platforms.
ok miod


# 1.76 14-Oct-2008 guenther

Back-in; problems were apparently elsewhere.
Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process. Use the reference count to update the user process
count correctly when changin real uid.

"please re-commit before something else nasty comes in" deraadt@


# 1.75 10-Oct-2008 deraadt

backout; is causing some people difficulty


# 1.74 09-Oct-2008 guenther

Put a reference count in struct process to prevent use-after-free
if the main thread reaches the reaper ahead of some other thread
in the process.

ok art@ tedu@


Revision tags: OPENBSD_4_4_BASE
# 1.73 11-May-2008 tedu

set p_flag to 0 sooner, so we don't overwrite the thread flag. and correctly
free things when exiting a threaded proc. from philip guenther


Revision tags: OPENBSD_4_3_BASE
# 1.72 10-Oct-2007 art

Make context switching much more MI:
- Move the functionality of choosing a process from cpu_switch into
a much simpler function: cpu_switchto. Instead of having the locore
code walk the run queues, let the MI code choose the process we
want to run and only implement the context switching itself in MD
code.
- Let MD context switching run without worrying about spls or locks.
- Instead of having the idle loop implemented with special contexts
in MD code, implement one idle proc for each cpu. make the idle
loop MI with MD hooks.
- Change the proc lists from the old style vax queues to TAILQs.
- Change the sleep queue from vax queues to TAILQs. This makes
wakeup() go from O(n^2) to O(n)

there will be some MD fallout, but it will be fixed shortly.
There's also a few cleanups to be done after this.

deraadt@, kettenis@ ok


Revision tags: OPENBSD_4_2_BASE
# 1.71 12-Apr-2007 tedu

move p_limit and p_cred into struct process
leave macros behind for now to keep the commit small
ok art beck miod pedro


# 1.70 11-Apr-2007 tedu

remove proc from process thread list sooner in exit (notably, before waiting
for the list to become empty)
ok art


# 1.69 10-Apr-2007 tedu

undo


# 1.68 10-Apr-2007 tedu

remove process from thread list sooner in exit (notably, before waiting
for the list to become empty)


# 1.67 05-Apr-2007 tedu

jason crawford noticed that the rthreads diff didn't compile with rthreads!


# 1.66 04-Apr-2007 pedro

oursleves -> ourselves


# 1.65 04-Apr-2007 pedro

Don't remove the process from the threads queue in proc_zap() as that
currently already happens in exit1(), okay art@


# 1.64 03-Apr-2007 art

Start moving state that is shared among threads in a process into
a new struct. Instead of doing a huge rename and deal with the fallout
for weeks, like other projects that need no mention, we will slowly and
carefully move things out of struct proc into a new struct process.

- Create struct process and the infrastructure to create and remove them.
- Move threads in a process into struct process.

deraadt@, tedu@ ok


# 1.63 15-Mar-2007 art

Since p_flag is often manipulated in interrupts and without biglock
it's a good idea to use atomic.h operations on it. This mechanic
change updates all bit operations on p_flag to atomic_{set,clear}bits_int.

Only exception is that P_OWEUPC is set by MI code before calling
need_proftick and it's automatically cleared by ADDUPC. There's
no reason for MD handling of that flag since everyone handles it the
same way.

kettenis@ ok


Revision tags: OPENBSD_4_0_BASE OPENBSD_4_1_BASE
# 1.62 23-Jun-2006 mickey

consistantly count context switches on exit; miod@ ok


# 1.61 15-Jun-2006 miod

Nothing sets P_FSTRACE anymore, so remove all what's left of it.


# 1.60 06-Apr-2006 mickey

release kernel lock _after_ the emulation exit hook is called to protect possible free()s; tedu@ deraadt@ ok


Revision tags: OPENBSD_3_9_BASE
# 1.59 20-Feb-2006 miod

Compile out more rthreads stuff unless option RTHREADS;
discussed with a few, ok tedu@


# 1.58 13-Dec-2005 tedu

make exiting actually work when a thread receives a signal.
previously, the child and parent would deadlock in the kernel
and be unable to exit. help with diagnosis from art@.


# 1.57 03-Dec-2005 tedu

kernel support for threaded processes (rthreads).
uses rfork(RFTHREAD) to create threads, which are presently processes
that are a little more tightly bound together. several new syscalls
added to facilitate a userland thread library.
all conditional on RTHREADS, currently disabled.
ok deraadt


# 1.56 28-Nov-2005 jsg

ansi/deregister.
'go for it' deraadt@


# 1.55 14-Sep-2005 kettenis

ptrace(2) following fork(2)
ok miod@


Revision tags: OPENBSD_3_7_BASE OPENBSD_3_8_BASE
# 1.54 26-Dec-2004 miod

Use list and queue macros where applicable to make the code easier to read;
no change in compiler assembly output.


Revision tags: OPENBSD_3_6_BASE
# 1.53 04-Aug-2004 art

hardclock detects if ITIMER_VIRTUAL and ITIMER_PROF have expired and
sends SIGVTALRM and SIGPROF to the process if they had. There is a big
problem with calling psignal from hardclock on MULTIPROCESSOR machines
though. It means we need to protect all signal state in the process
with a lock because hardclock doesn't obtain KERNEL_LOCK. Trying to
track down all the tentacles of this quickly becomes very messy. What
saves us at the moment is that SCHED_LOCK (which is used to protect
parts of the signal state, but not all) happens to be recursive and
forgives small and big errors. That's about to change.

So instead of trying to hunt down all the locking problems here, just
make hardclock not send signals. Instead hardclock schedules a timeout
that will send the signal later. There are many reasons why this works
just as good as the previous code, all explained in a comment written
in big, friendly letters in kern_clock.

miod@ ok noone else dared to ok this, but noone screamed in agony either.


# 1.52 22-Jul-2004 art

SIMPLELOCK -> mutex for the lock around deadproc list.
Also move the whole deadproc infrastructure to kern_exit, it's only used
there.

miod@ ok


# 1.51 13-Jun-2004 niklas

debranch SMP, have fun


Revision tags: SMP_SYNC_A SMP_SYNC_B
# 1.50 27-May-2004 tedu

make acct(2) optional with ACCOUNTING
ok art@ deraadt@


Revision tags: OPENBSD_3_5_BASE
# 1.49 20-Mar-2004 tedu

one proc.h is sufficient


# 1.48 31-Dec-2003 millert

wait4(2) takes and returns pid_t, not int. OK deraadt@ and miod@


Revision tags: OPENBSD_3_4_BASE
# 1.47 03-Aug-2003 millert

Implement the WCONTINUED flag to the wait(2) family of syscalls and the
associated WIFCONTINUED macro as per 1003.1-2001. Adapted from FreeBSD.
A minor amount of trickiness is involved here. The value for WCONTINUED
is chosen in such a way that _WSTATUS(_WCONTINUED) == _WSTOPPED and the
WIFSTOPPED macro has been modified such that WIFSTOPPED(_WCONTINUED) !=
_WSTOPPED. This means we don't need to add an extra check to the
WIFSIGNALED and WIFSTOPPED macros. deraadt@ OK.


# 1.46 21-Jul-2003 tedu

remove caddr_t casts. it's just silly to cast something when the function
takes a void *. convert uiomove to take a void * as well. ok deraadt@


# 1.45 21-Jun-2003 tedu

add exec/fork/exit hooks per process for compat emulations.
use them to correctly emulate linux brk.
update to TNF copyright in linux_exec.c.

from netbsd, mostly from a diff by Kurt Miller in pr3318.
this should fix java. no regressions in testing by kurt and sturm@.
be prepared for "proc size mismatch" -- recompile ps and friends.
ok deraadt@


# 1.44 02-Jun-2003 millert

Remove the advertising clause in the UCB license which Berkeley
rescinded 22 July 1999. Proofed by myself and Theo.


Revision tags: OPENBSD_3_3_BASE UBC_SYNC_A
# 1.43 29-Oct-2002 art

No need to free the address space in exit1(), we'll do that in the reaper.
That gives us the advantage of not being the active address space when
freeing the mappings in the pmap, which can lead to expensive TLB
flushes on some architectures.

plus some minor cleaning.


Revision tags: OPENBSD_3_2_BASE UBC_SYNC_B
# 1.42 16-May-2002 provos

systrace facility, used to enforce and generate policies for system calls
okay deraadt@


Revision tags: OPENBSD_3_1_BASE
# 1.41 14-Mar-2002 millert

First round of __P removal in sys


# 1.40 25-Jan-2002 art

poolify pcreds.


# 1.39 23-Jan-2002 art

Allocate rusage, pgrp, ucred and session with pool.


# 1.38 16-Jan-2002 miod

Don't include <sys/map.h> when you don't need what's in it.


Revision tags: UBC_BASE
# 1.37 12-Nov-2001 art

branches: 1.37.2;
Bring in more changes from NetBSD. Mostly pagedaemon improvements.


# 1.36 06-Nov-2001 miod

Replace inclusion of <vm/foo.h> with the correct <uvm/bar.h> when necessary.
(Look ma, I might have broken the tree)


Revision tags: OPENBSD_3_0_BASE
# 1.35 11-Sep-2001 miod

Don't include <vm/vm_kern.h> if you don't need foo_map.


# 1.34 25-Aug-2001 art

cleanup


# 1.33 27-Jun-2001 art

remove old vm


# 1.32 03-Jun-2001 angelos

WALTSIG is a valid option for sys_wait4().


# 1.31 16-May-2001 millert

kill COMPAT_{09,10,11} kernel options. We still need kern_info_09.c and kern_ipc_10.c for other compat modules.


Revision tags: OPENBSD_2_9_BASE
# 1.30 02-Apr-2001 niklas

On popular demand, the Linux-compatibility clone(2) implementation based
on NetBSD's code, as well as some faked Posix RT extensions by me. This makes
at least simple linuxthreads tests work.


# 1.29 23-Mar-2001 art

Use pool to allocate processes.


# 1.28 16-Nov-2000 provos

support kernel event queues, from FreeBSD by Jonathan Lemon,
okay art@, millert@


Revision tags: OPENBSD_2_8_BASE
# 1.27 06-Jun-2000 art

If the process is P_NOZOMBIE free it's resources in the reaper.
Noone will wait for it (because noone is allowed to wait for it).


# 1.26 05-Jun-2000 art

No need to use curproc here. We already know who we are.


# 1.25 05-Jun-2000 art

Changes to exit handling.

cpu_exit no longer frees the vmspace and u-area. This is now handled by a
separate kernel thread "reaper". This is to avoid sleeping locks in the
critical path of cpu_exit where we're not allowed to sleep.

From NetBSD


Revision tags: OPENBSD_2_7_BASE
# 1.24 05-May-2000 art

branches: 1.24.2;
Don't set filesize limit to infinity on exit.
This is only needed in accounting and has to be done carefully because
the limit structures are shared between processes.

Found by Denis A. Doroshenko, analysed by Hannah Schroeter.


# 1.23 20-Apr-2000 art

Add a function "ktrsettracevnode", that changes the ktrace vnode for a process
in a correct way. Use it in all places where the vnode was changed.
(most of the earlier code was incorrect and had races).


# 1.22 23-Mar-2000 art

Use the new timeout facilities for ITIMER_REAL.


# 1.21 21-Feb-2000 art

dead code and symbol pollution.


Revision tags: OPENBSD_2_6_BASE SMP_BASE kame_19991208
# 1.20 15-Aug-1999 pjanzen

branches: 1.20.4;
Adopt NetBSD fix for scheduler problems (nice was broken). From the NetBSD
commit messages:

Scheduler bug fixes and reorganization
* fix the ancient nice(1) bug, where nice +20 processes incorrectly
steal 10 - 20% of the CPU, (or even more depending on load average)
* provide a new schedclock() mechanism at a new clock at schedhz, so high
platform hz values don't cause nice +0 processes to look like they are
niced
* change the algorithm slightly, and reorganize the code a lot
* fix percent-CPU calculation bugs, and eliminate some no-op code

=== nice bug === Correctly divide the scheduler queues between niced and
compute-bound processes. The current nice weight of two (sort of, see
`algorithm change' below) neatly divides the USRPRI queues in half; this
should have been used to clip p_estcpu, instead of UCHAR_MAX. Besides
being the wrong amount, clipping an unsigned char to UCHAR_MAX is a no-op,
and it was done after decay_cpu() which can only _reduce_ the value. It
has to be kept <= NICE_WEIGHT * PRIO_MAX - PPQ or processes can
scheduler-penalize themselves onto the same queue as nice +20 processes.
(Or even a higher one.)

=== New schedclock() mechanism === Some platforms should be cutting down
stathz before hitting the scheduler, since the scheduler algorithm only
works right in the vicinity of 64 Hz. Rather than prescale hz, then scale
back and forth by 4 every time p_estcpu is touched (each occurance an
abstraction violation), use p_estcpu without scaling and require schedhz
to be generated directly at the right frequency. Use a default stathz (well,
actually, profhz) / 4, so nothing changes unless a platform defines schedhz
and a new clock.
[ To do: Define these for alpha, where hz==1024, and nice was totally broke.]

=== Algorithm change === The nice value used to be added to the
exponentially-decayed scheduler history value p_estcpu, in _addition_ to
be incorporated directly (with greater weight) into the priority calculation.
At first glance, it appears to be a pointless increase of 1/8 the nice
effect (pri = p_estcpu/4 + nice*2), but it's actually at least 3x that
because it will ramp up linearly but be decayed only exponentially, thus
converging to an additional .75 nice for a loadaverage of one. I killed
this: it makes the behavior hard to control, almost impossible to analyze,
and the effect (~~nothing at for the first second, then somewhat increased
niceness after three seconds or more, depending on load average) pointless.

=== Other bugs === hz -> profhz in the p_pctcpu = f(p_cpticks) calcuation.
Collect scheduler functionality. Try to put each abstraction in just one
place.


# 1.19 15-Jul-1999 art

Don't destroy sysvshm if the vmspace is shared (only affects uvm)


# 1.18 23-Jun-1999 art

Improved sysv shared memory. Works with UVM.
Original work done in FreeBSD, but this code was ported from NetBSD by
Chuck Cranor.


Revision tags: OPENBSD_2_5_BASE
# 1.17 12-Mar-1999 deraadt

in scheduler, bias parents for child cpu usage; ross@ghs.com


# 1.16 02-Mar-1999 niklas

RFNOWAIT does not dissociate the child from its parent in any other
way than that the parent wait call will never get the status of this child,
says Rob


# 1.15 26-Feb-1999 art

vm allocation changes for uvm


# 1.14 11-Jan-1999 niklas

comment typo


# 1.13 10-Jan-1999 niklas

Make RFNOWAIT work in rfork(2)


Revision tags: OPENBSD_2_3_BASE OPENBSD_2_4_BASE
# 1.12 06-Nov-1997 csapuntz

Updates for VFS Lite 2 + soft update.


Revision tags: OPENBSD_2_2_BASE
# 1.11 06-Oct-1997 deraadt

back out vfs lite2 till after 2.2


# 1.10 06-Oct-1997 csapuntz

VFS Lite2 Changes


# 1.9 15-Sep-1997 millert

From FreeBSD (joerg@freebsd.org):
Implement SA_NOCLDWAIT by reparenting kids of processes that have
the appropriate bit set to PID 1, and let PID 1 handle the zombie.
This assumes that PID 1 will wait for its kids (which is true of init).
This also includes some FreeBSD sigaction.2.


Revision tags: OPENBSD_2_1_BASE
# 1.8 26-Oct-1996 tholo

Verify that options to wait4() are legal


Revision tags: OPENBSD_2_0_BASE
# 1.7 15-Aug-1996 tholo

Clear p_pctcpu when a process exit


# 1.6 02-May-1996 deraadt

sync syscalls, no sys/cpu.h


# 1.5 03-Mar-1996 niklas

From NetBSD: 960217 merge


# 1.4 30-Dec-1995 deraadt

from netbsd:
Remove the process from zombproc and its parent's child list before freeing
its resources.


# 1.3 14-Dec-1995 deraadt

from netbsd; limfree()


# 1.2 22-Nov-1995 deraadt

release text vnode before releasing credentials. vnode releasing can
block, but credentials should be alive until the process is really
dead. from tegge@idt.unit.no; netbsd pr#1767


# 1.1 18-Oct-1995 deraadt

branches: 1.1.1;
Initial revision